BMC frequently reboots and multiple sensor errors
Applies to
- FAS2750
- FAS2720
- AFF A220
- FAS2650
- FAS2620
- BMC fw. 11.6
- IOM12E fw. 2.20 or lower
Issue
- EMS error alert:
Sun May 09 13:29:30 CEST [node_name: env_mgr: callhome.c.fan.fru.fault:error]: Call home for CHASSIS FAN FRU FAILED: Multiple fans have failed
- BMC event messages:
Record 1746: Sun May 09 11:42:16.460000 2021 [BMC.critical]: Rebooting SP due to loss of ACP comms
Record 1747: Sun May 09 11:42:17.570000 2021 [ASUP.notice]: First notification email | (INVALID CHASSIS CONFIGURATION (Incompatible Partner PCM)) CRITICAL | Send failed
failed
Record 1748: Sun Jan 01 00:00:22.270000 2017 [IPMI.notice]: 0019 | c0 | OEM: ffff70005100 | ManufId: 150300 | BMC Reset Internally
- Multiple EMS errors are reported for different components, some "fixed" after a few seconds. Example:
Sun May 09 12:26:59 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 4 Temp) is not readable.
Sun May 09 12:26:59 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 1 Temp) is not readable.
Sun May 09 12:27:00 CEST [node_name: monitor: monitor.globalStatus.critical:EMERGENCY]: Chassis temperature is too high..
Sun May 09 12:27:10 CEST [node_name: env_mgr: monitor.chassisTemperature.ok:notice]: Chassis temperature is ok.
Sun May 09 12:28:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
Sun May 09 13:28:27 CEST [node_name: dsa_worker2: ses.status.temperatureWarning:alert]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature warning for Temperature sensor 11: not installed or failed. Current temperature: 41 C (105 F). This module is on the rear of the shelf at the top left, on shelf module A.
Sun May 09 13:28:27 CEST [node_name: dsa_worker2: ses.status.temperatureWarning:alert]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature warning for Temperature sensor 12: not installed or failed. Current temperature: 24 C (75 F). This module is on the rear of the shelf at the top left, on shelf module A.
Sun May 09 13:29:00 CEST [node_name: env_mgr: monitor.fan.warning:notice]: multiple fans have failed. Replace it to avoid overheating
Sun May 09 13:30:00 CEST [node_name: monitor: monitor.globalStatus.critical:EMERGENCY]: Multiple fans has failed. Chassis temperature is too high..
Sun May 09 13:32:12 CEST [node_name: dsa_worker3: ses.status.temperatureInfo:info]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature information for Temperature sensor 11: normal status.
Sun May 09 13:32:12 CEST [node_name: dsa_worker3: ses.status.temperatureInfo:info]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature information for Temperature sensor 12: normal status.
Sun May 09 13:33:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
Sun May 09 13:53:31 CEST [node_name: env_mgr: monitor.fru.info.unreadable:error]: The inventory information of FRU PSU1 is not readable.
Sun May 09 13:53:31 CEST [node_name: env_mgr: monitor.fru.info.unreadable:error]: The inventory information of FRU PSU2 is not readable.
Sun May 09 14:00:00 CEST [node_name: statd: monitor.fan.failed:alert]: Multiple fans has failed.
Sun May 09 14:01:55 CEST [node_name: env_mgr: monitor.fru.info.readable:info]: The inventory information of FRU PSU1 is readable.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.fru.info.readable:info]: The inventory information of FRU PSU2 is readable.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.fan.ok:notice]: All fans are OK.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.chassisTemperature.ok:notice]: Chassis temperature is ok.
Sun May 09 14:02:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
Mon May 10 23:39:07 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module B Expander Temp) is not readable.
Mon May 10 23:39:07 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module A Expander Temp) is not readable.
- Node may be panic by multi fan failure.