What is the workflow/time delay of failover and failback process in case of OTS two node cluster?
Applies to
- ONTAP Select (OTS)
- ONTAP Select Deploy
- Syncmirror
Answer
By default, in Active/Active HA pair, if one node in the HA pair panics, reboots, or halts, the partner node automatically takes over and then returns storage when the affected node reboots.
The HA pair then resumes a normal operating state. Automatic takeovers may also occur if one of the nodes become unresponsive.
When a node takes over its partner, it continues to serve and update data in the partner's aggregates and volumes. To do this, it takes ownership of the partner's aggregates, and the partner's LIFs migrate according to network interface failover rules.
Average Time Delay
The average time delay for failover and failback process in case of OTS two node cluster can vary depending on the number and size of the aggregates. To break this down further, it can be approximately 1 minute for the transfer the compute and protocol, then no more then 2 to transfer the networking and up to 5 to transfer storage
(depends on number and size of aggregates).
NetApp uses Raid SyncMirror (RSM) already present in ONTAP to replicate datablocks between the cluster nodes for the HA functionality.
- SyncMirror in OTS uses the dedicated Cluster network (Internal Port Group) and NVRAM to replicate disk writes between local Pool0 and remote Pool1 on OTS nodes.
- Disk reads do not need to sync as they are read directly from the local Pool0
- HA RSM and mirrored aggregates additional details.
Scenarios that can affect TO/GB:
- Power outages:
- If both nodes or ESXi hosts to go down would induce "dirty shutdown" and can cause lost writes and WAFL filesystem corruption.
- OTS is software defined storage and 100% dependent upon the health of the VMware/ESXi hosts.
- Network disruptions will affect OTS no differently than any other solution.
- NAS protocols will have timeouts and lost access unless suitable redundancy is configured in the network (redundant switches, port channels, port trunking).
- OTS can leverage LIF failover to the partner node if the network disruption affects only a single esx node.
- SAN protocols in OTS use iSCSI.
- The same rules for NAS apply to iSCSI.
- Resiliency depends on the level of redundancy built into the network hardware and ESX configuration.
- Keep in mind, OnTap Select is software defined storage. At a high level the difference is you supply the compute and storage hardware.
- NAS protocols will have timeouts and lost access unless suitable redundancy is configured in the network (redundant switches, port channels, port trunking).
For more information on understanding ONTAP Select architechture and Best Practices, refer to: Technical Report TR-4517 for architecture and best practice configuration of OTS.
Additional Information
additionalInformation_text