CVO Node Reboot HA Takeover Impossible Due to Azure VNet Encryption Networking Issue
Applies to
- Cloud Volumes ONTAP (CVO) in Microsoft Azure
- ONTAP 9
Issue
- A Cloud Volumes ONTAP HA cluster in Azure experienced a full outage where one node unexpectedly rebooted, and its partner node was unable to perform a takeover. Both nodes later returned to a healthy state, but the initial incident triggered the following log messages:
[cluster-02:cf_main:callhome.partner.down:EMERGENCY]: Callhome for PARTNER DOWN, TAKEOVER IMPOSSIBLE[cluster-02:cf_main:cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of cluster-01 disabled (unsynchronized log).[cluster-02:cf_main:cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of cluster-01 disabled (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support).[cluster-01:nlbd:vsa.azure.nlb.probeInactive:alert]: Failed to receive Load Balancer probe (now inactive) for 2 ports (port range: 63001 to 63010), within 15 seconds.[cluster-01:mgwd:dns.server.timed.out:error]: DNS server 10.0.0.25 did not respond to vserver=snm_name08 within timeout interval.[cluster-01:vifmgr:vifmgr.cluscheck.droppedall:alert]: Total packet loss when pinging from cluster lif cluster-01_clus_1 (node cluster-01) to cluster lif cluster-02_clus_2 (node cluster-02).[cluster-01:raid.vol.reparity.issue:notice]: Aggregate aggr1_1801 has invalid NVRAM contents.[cluter-01:nv.data.loss.possible:notice]: An unexpected shutdown occurred while in high write speed mode, which possibly caused a loss of data.