CFBMC-609: Temperature-based policy implemented for multiple fan failures on HA systems with shared cooling
Issue
- Previously, the system environmental policy initiated a shutdown when two or
more system fan failures occurred. On high-availability (HA) systems with
shared cooling, both nodes initiated a shutdown - resulting in service
disruption.
SP logs:
[callhome.shlf.fan:EMERGENCY]: Call home for SHELF COOLING UNIT FAILED
[ses.status.fanError:EMERGENCY]: DS224-12 (S/N *****) shelf 0 on channel 0b cooling fan error for Cooling element 3: critical status; fan is off. This module is on the rear of the shelf on the lower right power supply.
[ses.status.fanError:EMERGENCY]: DS224-12 (S/N******) shelf 0 on channel 0b cooling fan error for Cooling element 4: critical status; fan is off. This module is on the rear of the shelf on the lower right power supply.
[monitor.globalStatus.critical:EMERGENCY]: Disk shelf fault.
[callhome.shlf.fan:EMERGENCY]: Call home for SHELF COOLING UNIT FAILED
[monitor.globalStatus.critical:EMERGENCY]: Disk shelf fault.
[callhome.shlf.fan:EMERGENCY]: Call home for SHELF COOLING UNIT FAILED
[monitor.fan.critical:EMERGENCY]: 2 fans have failed. Replace them to avoid overheating. If not corrected, system will shutdown in 2 minutes.
[monitor.globalStatus.critical:EMERGENCY]: Multiple fans has failed. Disk shelf fault.
[callhome.fans.failed:EMERGENCY]: Call home for MULTIPLE FAN FAILURE
[monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (Multiple fans failed)
- Due to the change implemented as a fix to this bug, a temperature-based
environmental policy continues to be used to determine the need to perform
a system shutdown, regardless of the number of system fan failures.