Unexpected A700s reboot on BMC 1.81 or higher
- Views:
- 2,226
- Visibility:
- Public
- Votes:
- 0
- Category:
- aff-series
- Specialty:
- hw
- Last Updated:
- 5/24/2024, 4:56:44 PM
Applies to
- AFF A700s
- BMC 1.81 or higher
Issue
- AFF A700s node reboots unexpectedly.
- The Service Processor resets the node and the partner takes over:
[node_name_2: cf_hwassist: cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(node_name_1), system_down because reset_via_sp.
W[node_name_2: cf_hwassist: cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(node_name_1), system_down because l2_watchdog_reset.
[node_name_2: swi1: mri_ha: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state MIRROR_ONLINE is aborted because of reason Abort Pending.
[node_name_2: gop_eq_thread: ic.linkStatusChange:info]: HA interconnect: Port ic1a link is down.
[node_name_2: cf_fastTimeout: cf.ic.heartBeatFailed:error]: HA interconnect: Heartbeat failed.
[node_name_2: cf_main: cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of node_name_2 by node_name_1 disabled (unsynchronized log).
[node_name_2: rastrace_dump: rastrace.dump.saved:debug]: A RAS trace dump for module IC instance 0 was stored in /etc/log/rastrace/IC_0_20201027_17:15:50:245981.dmp.
[node_name_2: ctrl_hb_port_ic1a: ctrl.rdma.heartBeat:info]: HA interconnect: Missed heartbeat to 192.0.1.4.
[node_name_2: cf_main: cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of node_name_2 by node_name_1 disabled (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support).
- The impacted node reboots and may work fine after the giveback following the watchdog reset
- The BMC sel log shows NMI and watchdog information:
420 | 03/03/2023 | 17:51:10 | CriticalInt | Software NMI | Asserted
421 | 03/03/2023 | 17:51:10 | Watchdog2 | Timer interrupt | Asserted
422 | 03/03/2023 | 17:51:12 | Watchdog2 | Hard reset | Asserted
423 | 03/03/2023 | 17:51:12 | SysReset | State Asserted | Asserted
424 | 03/03/2023 | 18:20:22 | Platform Security #0x00 | Transition to Off Line | Asserted