AFF-A250 unexpected node reboot (BMC 15.6, 15.7)
Applies to
- AFF-A250
- Baseboard Management Controller (BMC) fw. 15.7 or lower
Issue
- Unexpected node halt. Example:
[node_name: spmgrd: sp.heartbeat.stopped:error]: Have not received a IPMI heartbeat from the Service Processor (SP) in last 600 seconds.
[node_name: spmgrd: callhome.sp.hbt.missed:notice]: Call home for SP HBT MISSED
[node_name: spmgrd: callhome.sp.hbt.stopped:alert]: Call home for SP HBT STOPPED
[node_name: env_mgr: sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 10 minutes.
[node_name: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the BMC)
[node_name: mgwd: mgwd.notify.halt.result:info]: MGWD able to notify CLAM on its HA partner node that this node is undergoing a planned shutdown (reason: E). Error: -
- BMC event logs indicate IPMI cold reset with multiple bus correctable errors:
BMC node_name> system log sel
3e1 | 03/08/2023 | 16:09:46 | Critical Interrupt #0x31 | Bus Correctable error | Asserted
3e2 | 03/08/2023 | 16:09:46 | Critical Interrupt #0x31 | Bus Correctable error | Asserted
...
3f1 | OEM record f2 | IPMI cold reset
3f2 | OEM record f2 | Pilot Software reset
- Or BMC reset by FPGA:
1c9 | OEM record f2 | FPGA pull BMC whole reset
1ca | OEM record f2 | Pilot AC cycle
- BMC of the node may not be accessible, even via serial console port.