CHW-934: AFF A400 does not detect partner causing failover
Issue
- Unexpected node reboot with no ONTAP EMS explicit outputs showing the reason
- Initiated partner takeover with similar message to:
[[node_name-01: cf_main: cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(node_name-02), system_down because controller_inaccessible.]
- No BMC console logs showing the reason for the node going down.
- No BMC events showing the reason for the node going down.
- From surviving partner node, just before the TO, the cluster and interconnect ports are down:
[[node_name-01: kernel: netif.linkDown:info]: Ethernet e0a: Link down, check cable.]
[[node_name-01: intr: rlib.ifconfig.linkEvent:notice]: params: {'ifname': 'e0a', 'eventType': 'DOWN'}]
[[node_name-01: kernel: netif.linkDown:info]: Ethernet e0b: Link down, check cable.]
[[node_name-01: intr: rlib.ifconfig.linkEvent:notice]: params: {'ifname': 'e0b', 'eventType': 'DOWN'}]
[[node_name-01: mcc_cfd_rnic: mirror.stream.qp.error:debug]: params: {'mirror': 'HA Partner', 'qp_name': 'RAID', 'error': 'NVMM_ERR_POLL'}]
[[node_name-01: mcc_cfd_rnic: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state NVMM_MIRROR_ONLINE is aborted because of reason NVMM_ERR_POLL.]
[[node_name-01: mcc_cfd_rnic: mirror.stream.qp.error:debug]: params: {'mirror': 'HA Partner', 'qp_name': 'MISC', 'error': 'NVMM_ERR_POLL'}]
[[node_name-01: nvmm_error: nvmm.mirror.offlined:debug]: params: {'mirror': 'HA_PARTNER'}]
[[node_name-01: cf_main: cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of node_name-02 by node_name-01 disabled (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support).]