Skip to main content
NetApp Knowledge Base

What is the workflow/time delay of failover and failback process in case of OTS two node cluster?

Views:
226
Visibility:
Public
Votes:
0
Category:
ontap-select
Specialty:
virt
Last Updated:

Applies to

  • ONTAP Select (OTS)
  • ONTAP Select Deploy
  • Syncmirror

Answer

By default, in Active/Active HA pair, if one node in the HA pair panics, reboots, or halts, the partner node automatically takes over and then returns storage when the affected node reboots.

The HA pair then resumes a normal operating state. Automatic takeovers may also occur if one of the nodes become unresponsive.

When a node takes over its partner, it continues to serve and update data in the partner's aggregates and volumes. To do this, it takes ownership of the partner's aggregates, and the partner's LIFs migrate according to network interface failover rules.

Average Time Delay

The average time delay for failover and failback process in case of OTS two node cluster can vary depending on the number and size of the aggregates. To break this down further,  it can be approximately 1 minute for the transfer the compute and protocol,  then no more then 2 to transfer the networking and up to 5 to transfer storage
(depends on number and size of aggregates).

NetApp uses Raid SyncMirror (RSM) already present in ONTAP to replicate datablocks between the cluster nodes for the HA functionality. 

  • SyncMirror in OTS uses the dedicated Cluster network (Internal Port Group) and NVRAM to replicate disk writes between local Pool0 and remote Pool1 on OTS nodes.
  • Disk reads do not need to sync as they are read directly from the local Pool0
  • HA RSM and mirrored aggregates additional details. 

Scenarios that can affect TO/GB:

  • Power outages:
    • If both nodes or ESXi hosts to go down would induce "dirty shutdown" and can cause lost writes and WAFL filesystem corruption.
    • OTS is software defined storage and 100% dependent upon the health of the VMware/ESXi hosts.
  • Network disruptions will affect OTS no differently than any other solution.
    • NAS protocols will have timeouts and lost access unless suitable redundancy is configured in the network (redundant switches, port channels, port trunking).
      • OTS can leverage LIF failover to the partner node if the network disruption affects only a single esx node.
    • SAN protocols in OTS use iSCSI.
      • The same rules for NAS apply to iSCSI.
      • Resiliency depends on the level of redundancy built into the network hardware and ESX configuration.
      • Keep in mind, OnTap Select is software defined storage. At a high level the difference is you supply the compute and storage hardware.

For more information on understanding ONTAP Select architechture and Best Practices, refer to: Technical Report TR-4517 for architecture and best practice configuration of OTS.

Additional Information

additionalInformation_text

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.