Skip to main content
NetApp Knowledge Base

BMC frequently reboots and multiple sensor errors

Views:
1,238
Visibility:
Public
Votes:
0
Category:
fas-systems
Specialty:
hw
Last Updated:

Applies to

  • FAS2750
  • FAS2720
  • AFF A220
  • FAS2650
  • FAS2620
  • BMC fw. 11.6
  • IOM12E fw. 2.20 or lower

Issue

  • EMS error alert:

Sun May 09 13:29:30 CEST [node_name: env_mgr: callhome.c.fan.fru.fault:error]: Call home for CHASSIS FAN FRU FAILED: Multiple fans have failed

  • BMC event messages:

Record 1746: Sun May 09 11:42:16.460000 2021 [BMC.critical]: Rebooting SP due to loss of ACP comms
Record 1747: Sun May 09 11:42:17.570000 2021 [ASUP.notice]: First notification email | (INVALID CHASSIS CONFIGURATION (Incompatible Partner PCM)) CRITICAL | Send
failedfailedRecord 1748: Sun Jan 01 00:00:22.270000 2017 [IPMI.notice]: 0019 | c0 | OEM: ffff70005100 | ManufId: 150300 | BMC Reset Internally

  • Multiple EMS errors are reported for different components, some "fixed" after a few seconds. Example:

Sun May 09 12:26:59 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 4 Temp) is not readable.

Sun May 09 12:26:59 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 1 Temp) is not readable.
Sun May 09 12:27:00 CEST [node_name: monitor: monitor.globalStatus.critical:EMERGENCY]: Chassis temperature is too high..
Sun May 09 12:27:10 CEST [node_name: env_mgr: monitor.chassisTemperature.ok:notice]: Chassis temperature is ok.
Sun May 09 12:28:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.

Sun May 09 13:28:27 CEST [node_name: dsa_worker2: ses.status.temperatureWarning:alert]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature warning for Temperature sensor 11: not installed or failed. Current temperature: 41 C (105 F). This module is on the rear of the shelf at the top left, on shelf module A.
Sun May 09 13:28:27 CEST [node_name: dsa_worker2: ses.status.temperatureWarning:alert]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature warning for Temperature sensor 12: not installed or failed. Current temperature: 24 C (75 F). This module is on the rear of the shelf at the top left, on shelf module A.

Sun May 09 13:29:00 CEST [node_name: env_mgr: monitor.fan.warning:notice]: multiple fans have failed. Replace it to avoid overheating
Sun May 09 13:30:00 CEST [node_name: monitor: monitor.globalStatus.critical:EMERGENCY]: Multiple fans has failed. Chassis temperature is too high..
Sun May 09 13:32:12 CEST [node_name: dsa_worker3: ses.status.temperatureInfo:info]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature information for Temperature sensor 11: normal status.
Sun May 09 13:32:12 CEST [node_name: dsa_worker3: ses.status.temperatureInfo:info]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature information for Temperature sensor 12: normal status.
Sun May 09 13:33:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.

Sun May 09 13:53:31 CEST [node_name: env_mgr: monitor.fru.info.unreadable:error]: The inventory information of FRU PSU1 is not readable.
Sun May 09 13:53:31 CEST [node_name: env_mgr: monitor.fru.info.unreadable:error]: The inventory information of FRU PSU2 is not readable.
Sun May 09 14:00:00 CEST [node_name: statd: monitor.fan.failed:alert]: Multiple fans has failed.
Sun May 09 14:01:55 CEST [node_name: env_mgr: monitor.fru.info.readable:info]: The inventory information of FRU PSU1 is readable.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.fru.info.readable:info]: The inventory information of FRU PSU2 is readable.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.fan.ok:notice]: All fans are OK.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.chassisTemperature.ok:notice]: Chassis temperature is ok.
Sun May 09 14:02:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.

Mon May 10 23:39:07 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module B Expander Temp) is not readable.
Mon May 10 23:39:07 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module A Expander Temp) is not readable.

  • Node may be panic by multi fan failure.

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.