CFBMC-6022: Takeover due to BMC SoC hang causes ONTAP shutdown with BMC FW 13.11
Issue
Platforms running:
- BMC FW 13.11
- BMC FW 13.11P1
Takeover performed by partner node after a missed and stopped heartbeat.
- The ONTAP has not received an IPMI heartbeat from the Service Processor (SP) in the last 600 seconds
- The ASUP system has notified that the SP heartbeat has stopped
- The system has been rebooted to recover the BMC
EMS log Example:
ERROR  asup.post.drop: AutoSupport message (HA Group Notification from node-01 (SP HBT MISSED) NOTICE) was not posted to NetApp. The system will drop the message.
ERROR  mgmtgwd.vreport.nodesUnreachable: Vreport encountered some unreachable nodes. The report may be incomplete.
ALERT  callhome.sfo.takeover: Call home for CONTROLLER TAKEOVER COMPLETE AUTOMATIC
ERROR  cf.fsm.takeoverOfPartnerDisabled: Failover monitor: takeover of node-02 disabled (local halt in progress).
EMERGENCY   monitor.shutdown.emergency: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the BMC)
EMERGENCY   sp.ipmi.lost.shutdown: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 10 minutes.
ALERT  callhome.sp.hbt.stopped: Call home for SP HBT STOPPED
ALERT  callhome.sp.hbt.missed:notice]: Call home for SP HBT MISSED
ERROR  sp.heartbeat.stopped: Have not received a IPMI heartbeat from the Service Processor (SP) in last 600 seconds.