CFBMC-3251: Many IO cards are reported as degraded and recovered by BMC reboot
Issue
Many IO cards are reported as degraded and recovered by BMC reboot
- Multiple sensors simultaneously experienced degradation, resulting in "is not readable" status for the sensors
[?] Wed Jul 10 19:06:17 +0900 [node-1: env_mgr: monitor.ioCard.degraded:alert]: IO card is degraded: IO1 SAS Inflow Temp is not readable [?] Wed Jul 10 19:06:20 +0900 [node-1: env_mgr: monitor.ioCard.degraded:alert]: IO card is degraded: IO1 SAS Outflow Temp is not readable
・ ・[?] Wed Jul 10 19:06:33 +0900 [node-1: env_mgr: monitor.ioCard.degraded:alert]: IO card is degraded: IO11 SAS P12V HS is not readable [?] Wed Jul 10 19:06:33 +0900 [node-1: env_mgr: monitor.ioCard.degraded:alert]: IO card is degraded: IO11 SAS Hot Swap Cur is not readable
- After an SP reboot was triggered immediately, and the message "Chassis temperature is too high" is displayed with the status "monitor.globalStatus.critical: EMERGENCY."
[?] Wed Jul 10 19:06:33 +0900 [node-1: env_mgr: sp.reboot.sensor.unreadable:notice]: Rebooting BMC because one or more sensors are unreadable. [?] Wed Jul 10 19:07:00 +0900 [node-1: monitor: monitor.globalStatus.critical:EMERGENCY]: Chassis temperature is too high.. [?] Wed Jul 10 19:07:48 +0900 [node-1: cf_worker: cf.hwassist.notifyCfgSuccess:debug]: params: {'hwtype': 'BMC'}
- However, the ASUP for "hm.alert.critical: alert" is triggered.
[?] Wed Jul 10 19:18:45 +0900 [node-1: mgwd: callhome.hm.alert.critical:alert]: Call home for Health Monitor process cphm: CriticalFruMultiFaultAlert[033243222222].