AFF A700s node reports SP HBT STOPPED and Emergency shutdown to recover the BMC
Applies to
- AFF A700s
- BMC (baseboard management controller)
Issue
- After ONTAP upgrade, nodes attempt to auto-update BMC
- Auto-updates fail on one or more nodes:
[cluster-01: servprocd: sp.servprocd.upd.evts:debug]: params: {'reason': 'BMC update - Pre-update checks passed.'}
[cluster-01: servprocd: sp.servprocd.upd.evts:debug]: params: {'reason': 'SP Firmware network update from 1.89 to 1.91 has been triggered.'}
[cluster-01: servprocd: sp.servprocd.upd.unexpt.evts:debug]: params: {'reason': 'BMC update - Update failed after timeout.'}
[cluster-01: servprocd: sp.servprocd.upd.error:error]: SP update error: SP firmware update failure has been detected.
[cluster-01: servprocd: sp.servprocd.upd.unexpt.evts:debug]: params: {'reason': 'BMC update pre-update checks failed.'}
[cluster-01: servprocd: sp.servprocd.upd.error:error]: SP update error: SP firmware update failure has been detected.
- This results in AutoSupport notifications of SP missed and stopped heartbeat
[cluster-01: env_mgr: callhome.sp.hbt.missed:notice]: Call home for SP HBT MISSED
[cluster-01: env_mgr: callhome.sp.hbt.stopped:alert]: Call home for SP HBT STOPPED
- After several days of this condition, nodes halt and are unreachable via BMC remotely:
[cluster-01: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the BMC)
- Console connection to node indicates no BMC logs (
system log console
,system log console bak
,system log sel
are all empty or contain only a single entry) - Attempting to boot the node from LOADER results in:
***************************************************
This platform is not supported in this release.
The system will now halt
***************************************************