System shutdown due to SP heartbeat stopped on AFF A250 or FAS500f (BMC 15.3 and earlier)
Applies to
- AFF A250
- FAS500f
- Baseboard Management Controller (BMC) 15.1P1, 15.2, and 15.3
Issue
- Node reboots due to BMC heartbeat stopped:
21:45:49 +0100 [node-01: spmgrd: sp.heartbeat.stopped:error]: Have not received a IPMI heartbeat from the Service Processor (SP) in last 600 seconds.
21:57:32 +0100 [node-01: spmgrd: sp.heartbeat.stopped:error]: Have not received a IPMI heartbeat from the Service Processor (SP) in last 600 seconds.
21:57:32 +0100 [node-01: spmgrd: callhome.sp.hbt.missed:notice]: Call home for SP HBT MISSED
22:09:09 +0100 [node-01: spmgrd: callhome.sp.hbt.stopped:alert]: Call home for SP HBT STOPPED
22:12:16 +0100 [node-01: env_mgr: sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 10 minutes.
22:22:16 +0100 [node-01: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the BMC)
- Due to the reboot the partner will perform a takeover
[Node-02: cf_main: cf.fsm.takeover.on.reboot:info]: Failover monitor: One node initiated automatic takeover after detecting that its partner node is rebooting.
- In some cases the node will not log anything during the event, and only the partner will report:
18:11:28 +0100 [node-A: cf_main: cf.fsm.takeover.noHeartbeat:alert]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node.