CFBMC-3996: Node reboots due to SP HBT STOPPED on BMC 13.10P1
Issue
- An AFF A400, AFF C400, ASA A400, ASA C400, FAS8300 or FAS8700 node reboots unexpectedly due to stopped / missed heartbeat
- The following events are an example of this issue:
[Node-01: spmgrd: sp.heartbeat.stopped:error]: Have not received a IPMI heartbeat from the Service Processor (SP) in last 600 seconds. [Node-01: spmgrd: callhome.sp.hbt.missed:notice]: Call home for SP HBT MISSED [Node-01: spmgrd: callhome.sp.hbt.stopped:alert]: Call home for SP HBT STOPPED [Node-01: env_mgr: sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 10 minutes. [Node-01: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the BMC)
- IPMI_KCS_ERR messages are observed at the timestamp of the reboot in sktrace.log:
2024-03-10T01:30:58Z 2180899785867098 [5:0] IPMI_KCS_ERR: kcs_start_write: cmd 0x31 nf 0x36 state 3 not write 2024-03-10T01:30:58Z 2180899785870130 [5:0] IPMI_KCS_ERR: KCS cmd 0x31 nf 0x36: Failed to start write 2024-03-10T01:30:59Z 2180900784460092 [15:0] IPMI_KCS_ERR: kcs_error: cmd 0x31 nf 0x36 IBF not 0 2024-03-10T01:30:59Z 2180901778714878 [18:0] IPMI_KCS_ERR: kcs_error abort: cmd 0x31 nf 0x36 IBF not 0 2024-03-10T01:31:00Z 2180902760811516 [18:0] IPMI_KCS_ERR: kcs_error cmd 0x31 nf 0x36 not idle 2024-03-10T01:31:00Z 2180903779141166 [2:0] IPMI_KCS_ERR: kcs_error: cmd 0x31 nf 0x36 IBF not 0
- The node reboots and comes back online.