Node down with 'Multiple fans failed' on BMC 11.4 or SP 5.7 and earlier
- Views:
- 3,093
- Visibility:
- Public
- Votes:
- 1
- Category:
- aff-series
- Specialty:
- HW
- Last Updated:
- 1/9/2025, 11:36:18 AM
Applies to
- ONTAP 9
- Baseboard Management Controller firmware lower than 11.4
- AFF C190, AFF A220, FAS2720, FAS2750
- Service Processor firmware lower than 5.7
- AFF A300, AFF A200, FAS8200, FAS2650, FAS2620
Issue
- Controller shuts down with multiple fan failure (both nodes in HA Pair may be impacted)
env_mgr: monitor.chassisFan.stop:error]: Chassis fan contains at least one stopped fan: SysFan3 F1 (1260 RPM)
env_mgr: monitor.chassisFan.stop:error]: Chassis fan contains at least one stopped fan: SysFan3 F2 (1260 RPM)
env_mgr: monitor.chassisFanFail.xMinShutdown:EMERGENCY]: Multiple Chassis Fan failure: System will shut down in 2 minutes.
monitor: monitor.globalStatus.critical:EMERGENCY]: Multiple fans has failed: Sysfan3 F1, Sysfan3 F2.
statd: monitor.fan.failed:alert]: Multiple fans has failed: Sysfan3 F1, Sysfan3 F2.
statd: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (Multiple fans failed)
- AutoSupports may trigger alerts:
HA Group Notification (CONTROLLER TAKEOVER COMPLETE HALT) NOTICE
HA Group Notification (Health Monitor process cphm: CriticalFruMultiFaultAlert[PSQ094195000111]) ALERT
-
If nodes fail to boot and controller reseat is attempted, nodes may remain down/unable to boot
- May see the following on boot in console logs:
Initializing System Memory ...
Loading Device Drivers ...
Configuring Devices ...
Waiting for SP ...
IPMI:Read midplane FRU 0 product info:timeout
IPMI:Read midplane FRU 0 product info:failed
Waiting for SP ...
IPMI:Get midplane FRU 1 inventory:timeout
SP failure. Resetting SP from primary FW. This can take a few minutes
Waiting for SP ...
SP failure. Resetting SP from backup FW. This can take a few minutes
Waiting for SP ...
Failed to recover SP
IPMI:Get midplane FRU 1 inventory:failed
IPMI:Get controller FRU inventory:failed
IPMI:Get midplane FRU 0 inventory:failed
Configuring Devices ...
IPMI PCI Slot Control failed.
- If boot succeeds, node may complain on different sensors and shut down again:
Mon May 24 10:07:52 GMT [nvram.hw.initWarn:WARNING]: NVRAM hardware initialization: Failed to get Battery FRU info.
May 24 10:11:19 [node-1:sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 2 minutes.
May 24 10:13:19 [node-1:monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the SP)
sysconfig -M
output for the reporting Fan:
Fri Jun 18 2021, 23:48:28 MSK !FAN3!021819020448!441-00025!A1!
Sat Jun 19 2021, 01:04:18 MSK !FAN3: Error reading FRU EEPROM
Sat Jun 19 2021, 07:05:27 MSK !FAN3!021819020448!441-00025!A1!
- EMS- Issue remains after FAN replacement:
env_mgr: monitor.chassisFan.stop:error]: Chassis fan contains at least one stopped fan: SysFan3 F1 (failed)
env_mgr: monitor.chassisFan.stop:error]: Chassis fan contains at least one stopped fan: SysFan3 F2 (failed)
env_mgr: monitor.chassisFanFail.xMinShutdown:EMERGENCY]: Multiple Chassis Fan failure: System will shut down in 2 minutes.
monitor: monitor.globalStatus.critical:EMERGENCY]: Multiple fans has failed: Sysfan3 F1, Sysfan3 F2.
statd: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (Multiple fans failed)
statd: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (Multiple fans failed)
shutdown_thread0: ha.localNodeShutDown:notice]: Shutdown of the local node has been initiated with inhibit_takeover set to FALSE.
cf_main: cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of cluster2-01 disabled (local halt in progress).
shutdown_thread0: kern.shutdown:notice]: System shut down because : "Environmental Shutdown".
Wed Jun 23 20:46:54 MSK [toaster1: mgwd: callhome.hm.alert.critical:alert]: Call home for Health Monitor process cphm: CriticalFanFruFaultAlert[021604000131].
Thu Jun 24 11:00:00 MSK [toaster1: statd: monitor.fan.failed:alert]: Multiple fans has failed: Sysfan3 F1, Sysfan3 F2.
Thu Jun 24 11:18:36 MSK [toaster1: env_mgr: monitor.chassisFan.stop:error]: Chassis fan contains at least one stopped fan: SysFan3 F1 (failed)
Thu Jun 24 11:18:37 MSK [toaster1: env_mgr: monitor.chassisFan.stop:error]: Chassis fan contains at least one stopped fan: SysFan3 F2 (failed)
Thu Jun 24 11:19:06 MSK [toaster1: env_mgr: callhome.c.fan.fru.fault:error]: Call home for CHASSIS FAN FRU FAILED: SysFan3 F1
Thu Jun 24 11:19:06 MSK [toaster1: env_mgr: callhome.c.fan.fru.fault:error]: Call home for CHASSIS FAN FRU FAILED: SysFan3 F2