AFF-A400: Repeated unexpected takeovers
Applies to
- AFF-A400
- FAS8300
- ONTAP 9
- Switchless cluster
Issue
- The system experienced repeated, unexplained controller takeovers in an AFF-A400 HA pair
- The takeovers were consistently triggered by simultaneous link-down events on the cluster interconnect ports (e3a/e3b) and HA interconnect ports (e0a/e0b), resulting in loss of heartbeat between nodes and automatic failover
Example:Sat May 24 09:55:24 [node-01: kernel: netif.linkDown:info]: Ethernet e0a: Link down, check cable.Sat May 24 09:55:24 [node-01: intr: rlib.ifconfig.linkEvent:notice]: params: {'ifname': 'e0a', 'eventType': 'DOWN'}Sat May 24 09:55:24 [node-01: mcc_cfd_rnic: mirror.stream.qp.error:debug]: params: {'mirror': 'HA Partner', 'qp_name': 'RAID', 'error': 'NVMM_ERR_POLL'}Sat May 24 09:55:24 [node-01: mcc_cfd_rnic: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state NVMM_MIRROR_ONLINE is aborted because of reason NVMM_ERR_POLL.Sat May 24 09:55:24 [node-01: mcc_cfd_rnic: mirror.stream.qp.error:debug]: params: {'mirror': 'HA Partner', 'qp_name': 'MISC', 'error': 'NVMM_ERR_POLL'}Sat May 24 09:55:24 [node-01: kernel: netif.linkDown:info]: Ethernet e0b: Link down, check cable.Sat May 24 09:55:24 [node-01: intr: rlib.ifconfig.linkEvent:notice]: params: {'ifname': 'e0b', 'eventType': 'DOWN'}Sat May 24 09:55:24 [node-01: nvmm_error: ems.engine.suppressed:debug]: Event 'rdma.rlib.event.error' suppressed 5 times in last 89403 seconds.Sat May 24 09:55:24 [node-01: nvmm_error: rdma.rlib.event.error:debug]: QP wafl event error: client disconnect.Sat May 24 09:55:24 [node-01: nvmm_error: nvmm.mirror.offlined:debug]: params: {'mirror': 'HA_PARTNER'}Sat May 24 09:55:24 [node-01: rastrace_dump: rastrace.dump.saved:debug]: A RAS trace dump for module IC instance 0 was stored in /etc/log/rastrace/IC_0_20250524_09:55:24:830194.dmp.Sat May 24 09:55:25 [node-01: cf_main: cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of node-01 by node-02 disabled (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support).Sat May 24 09:55:25 [node-01: cf_firmware: cf.fm.partnerFwTransition:info]: params: {'prevstate': 'SF_UP', 'newstate': 'SF_UNKNOWN', 'progresscounter': '0'}Sat May 24 09:55:26 [node-01: kernel: netif.linkDown:info]: Ethernet e3a: Link down, check cable.Sat May 24 09:55:26 [node-01: vifmgr: vifmgr.portdown:notice]: A link down event was received on node node-01, port e3a.Sat May 24 09:55:26 [node-01: vifmgr: vifmgr.clus.linkdown:EMERGENCY]: The cluster port e3a on node node-01 has gone down unexpectedly.Sat May 24 09:55:26 [node-01: vifmgr: vifmgr.lifmoved.linkdown:notice]: LIF node-01_clus1 (on virtual server xxxxxxxxxx), IP address 169.254.125.245, is being moved to node node-01, port e3b.Sat May 24 09:55:26 [node-01: kernel: netif.linkDown:info]: Ethernet e3b: Link down, check cable.Sat May 24 09:55:27 [node-01: nvmm_mirror_sync: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state NVMM_MIRROR_LAYOUT_SYNCING is aborted because of reason NVPM_ERR_MSG_SEND_FAILED.Sat May 24 09:55:27 [node-01: nvmm_error: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state NVMM_MIRROR_OFFLINE is aborted because of reason NVMM_ABORT_SYNCING_MIRROR.Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=0b063568234d2002) with Tag CTRAN: Connection from local lif 1024 to remote IP 169.254.128.145 is closed.Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=0b063568234d2002) with Tag CTRAN: Connection from local lif 1023 to remote IP 169.254.210.21 is closed.Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=10063568233579a4) with Tag CTRAN_SAN: Connection from local lif 1024 to remote IP 169.254.128.145 is closed.Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=10063568233579a4) with Tag CTRAN_SAN: Connection from local lif 1023 to remote IP 169.254.210.21 is closed.Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=0f063568233579a4) with Tag CTRAN_BULK: Connection from local lif 1024 to remote IP 169.254.128.145 is closed.Sat May 24 09:55:28 [node-01: CsmMpAgentThread: csm.connClose:debug]: Session (req=node-01:dblade, rsp=node-02:dblade, uniquifier=0f063568233579a4) with Tag CTRAN_BULK: Connection from local lif 1023 to remote IP 169.254.210.21 is closed.Sat May 24 09:55:29 [node-01: kltp: clam.heartbeat.state.change:info]: Heartbeats to node (name=node-02, ID=1001) are Failing.Sat May 24 09:55:29 [node-01: cf_main: cf.fsm.partnerNotResponding:notice]: Failover monitor: partner not respondingSat May 24 09:55:29 [node-01: cf_main: fm_lastHeartbeatInfo_1:debug]: params: {'current_time': '5692845290', 'time_since_htbt_read_upd_seq_ic': '5000', 'time_since_htbt_read_upd_seq_mb': '4974', 'time_since_htbt_read_success_ic': '5000', 'time_since_htbt_read_success_mb': '4974', 'time_since_htbt_read_attempt_ic': '1000', 'time_since_htbt_read_attempt_mb': '4974', 'time_since_htbt_write_ic': '0', 'time_since_htbt_write_mb': '16', 'time_since_ic_upd_major_seq': '484080000', 'time_since_mb_upd_major_seq': '484077516', 'time_since_ic_upd_minor_seq': '0', 'time_since_mb_upd_minor_seq': '16', 'time_since_firmware_written': '0', 'time_since_firmware_read': '4980', 'time_since_firmware_rcvd': '5000', 'time_since_first_htbt_write_mb_drop': '5692767954', 'time_since_recent_htbt_write_mb_drop': '5692767954', 'mb_htbt_drop_count': '1', 'partner_minor_seq_num_ic': '7899824', 'partner_major_seq_num_ic': '1747558659', 'partner_minor_seq_num_mb': '7899823', 'partner_major_seq_num_mb': '1747558659'}Sat May 24 09:55:29 [node-01: cf_main: cf.fsm.takeoverCountdown:info]: Failover monitor: takeover scheduled in 10 secondsSat May 24 09:55:29 [node-01: vifmgr: vifmgr.lifmoved.linkdown:notice]: LIF node-01_clus1 (on virtual server xxxxxxxxxx), IP address 169.254.125.245, is being moved to node node-01, port e3a.Sat May 24 09:55:29 [node-01: vifmgr: vifmgr.lifmoved.linkdown:notice]: LIF node-01_clus2 (on virtual server xxxxxxxxxx), IP address 169.254.211.94, is being moved to node node-01, port e3b.Sat May 24 09:55:29 [node-01: vifmgr: vifmgr.cluscheck.notassigned:alert]: Cluster LIF node-01_clus2 (node node-01) is not assigned to any port.Sat May 24 09:55:29 [node-01: vifmgr: callhome.clus.net.degraded:alert]: Call home for CLUSTER NETWORK DEGRADED: Cluster LIF Not Assigned to Any Port - Cluster LIF node-01_clus2 (node node-01) is not assigned to any portSat May 24 09:55:29 [node-01: vifmgr: vifmgr.cluscheck.notassigned:alert]: Cluster LIF node-01_clus1 (node node-01) is not assigned to any port. Sat May 24 09:55:39 [node-01: cf_main: cf.fsm.takeover.noHeartbeat:alert]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node.Sat May 24 09:55:39 [node-01: cf_main: cf.fsm.stateTransit:info]: Failover monitor: UP --> TAKEOVERSat May 24 09:55:39 [node-01: cf_takeover: ha.takeover.stateChng:debug]: params: {'old_state': 'NOT_IN_TAKEOVER', 'new_state': 'IN_CFO_TAKEOVER'}Sat May 24 09:55:39 [node-01: cf_takeover: cf.fm.takeoverStarted:notice]: Failover monitor: takeover started
- Upgrading the BMC firmware to the latest version(13.12) did not improve the situation
Example:
Service Processor Status: OnlineFirmware Version: 13.12
- The BMC debug log recorded some unexpected warning events
Example:
BMC node-02> bmc log debug
### /var/log/warning.log: ###Dec 30 04:31:40 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 04:31:47 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 04:31:51 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 04:31:52 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 06:43:11 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 06:43:18 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 06:43:22 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 06:43:23 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 06:46:41 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 06:46:48 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 06:46:52 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 06:46:53 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 07:04:11 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 07:04:18 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 07:04:22 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -Dec 30 07:04:23 IPMIMain: [13310 : 13327 WARNING][BTIfc.c:73] LPC Reset Occurred -
Jan 3 09:01:22 IPMIMain: [687 : 710 WARNING][BTIfc.c:73] LPC Reset Occurred -Jan 3 09:01:29 IPMIMain: [687 : 710 WARNING][BTIfc.c:73] LPC Reset Occurred -Jan 3 09:01:33 IPMIMain: [687 : 710 WARNING][BTIfc.c:73] LPC Reset Occurred -Jan 3 09:01:34 IPMIMain: [687 : 710 WARNING][BTIfc.c:73] LPC Reset Occurred -
