Both nodes in HA pair reboot due to power loss
Applies to
- FAS Systems
- AFF Systems
Issue
- Both nodes in an HA pair reboot at the same time.
- EMS logs example (repeated in both nodes at the same time) for DC undervoltage and AC FAil in both PSUs:
[node_name: dsa_worker3: ses.status.psWarning:error]: DS224-12 (S/N 012345678910) shelf 0 on channel 0b power warning for Power supply 1: warning status; DC undervoltage. This module is on the rear of the shelf at the bottom left.
[node_name: dsa_worker4: ses.status.psError:alert]: DS224-12 (S/N 012345678910) shelf 0 on channel 0b power error for Power supply 1: critical status; AC Fail. This module is on the rear of the shelf at the bottom left.
[node_name: dsa_worker4: callhome.shlf.power.intr:error]: Call home for SHELF POWER INTERRUPTED
[node_name: statd: monitor.shelf.fault:alert]: Critical fault reported on disk storage shelf attached to channel 0b. Check fans, power supplies, disks, and temperature sensors.
[node_name: power_low_monitor: monitor.chassisPower.degraded:alert]: Chassis power is degraded: Power Supply Status Critical: PSU1.
[node_name: power_low_monitor: callhome.chassis.power:error]: Call home for CHASSIS POWER DEGRADED: Power Supply Status Critical: PSU1.
[node_name: monitor: monitor.globalStatus.critical:EMERGENCY]: Power Supply Status Critical: PSU1. Disk shelf fault.
[node_name: dsa_worker2: ses.status.psInfo:info]: DS224-12 (S/N 9872957495809) shelf 0 on channel 0b power supply information for Power supply 1: normal status.
[node_name: dsa_worker0: ses.status.psWarning:error]: DS224-12 (S/N 012345678910) shelf 0 on channel 0b power warning for Power supply 2: warning status; DC undervoltage. This module is on the rear of the shelf at the bottom right.
[node_name: dsa_worker2: callhome.shlf.ps.fault:error]: Call home for SHELF POWER SUPPLY WARNING
- BMC/SP events report power loss (repeated in both nodes at the same time):
Record 2435: Mon Dec 05 22:33:43.000000 2022 [BMC.emergency]: System input power lost
Record 2436: Sun Jan 01 00:00:22.310000 2017 [IPMI.notice]: 05f2 | c0 | OEM: ffff7000ff00 | ManufId: 150300 | BMC Power Reset
Record 2437: Sun Jan 01 00:00:22.330000 2017 [IPMI.notice]: 05f3 | c0 | OEM: fcff70560000 | ManufId: 150300 | POS Register: Power on Reset(Normal Power Cycle)
OR
Record 1596: Sat Sep 11 08:03:16 2021 [SP.emergency]: System input power lost
Record 1597: Thu Jan 1 00:00:32 1970 [IPMI.notice]: ce01 | c0 | OEM: ffff7000ff00 | ManufId: 150300 | SP Power Reset
Record 1598: Thu Jan 1 00:00:32 1970 [IPMI.notice]: cf01 | c0 | OEM: fcff70560000 | ManufId: 150300 | POS Register: Power on Reset(Normal Power Cycle)
- BMC/SP system log reporting power issues (repeated in both nodes at the same time) Example:
BMC IPMIMain[1142]: [1142 : 1167 INFO]PEF.c: Power Action:needed(0) action(0); Alert Action: needed(1) action(17)
BMC IPMIMain[1142]: [1142 : 1167 INFO]PEF.c: EventFilter: event on sensor(#0x32 dir:3) match (15) ALERT
BMC IPMIMain[1142]: [1142 : 1167 INFO]PEF.c: Power Action:needed(0) action(0); Alert Action: needed(1) action(17)
BMC IPMIMain[1142]: [1142 : 1167 INFO]PEF.c: EventFilter: event on sensor(#0x34 dir:3) match (15) ALERT
BMC IPMIMain[1142]: [1142 : 1167 INFO]PEF.c: Power Action:needed(0) action(0); Alert Action: needed(1) action(17)
BMC hsam[1426]: FRU /chassis-1 LED on
BMC hsam[1426]: FRU /chassis-1/controller-b/cna-3 LED on
BMC hsam[1426]: HSAM OS(bmc):cmd(set) FLD(cna-4):fault(Overcurrent Protection Fault)
BMC IPMIMain[1142]: [1142 : 1167 INFO]PEF.c: EventFilter: event on sensor(#0x5b dir:3) match (15) ALERT
BMC hsam[1426]: FRU /chassis-1 LED on
BMC IPMIMain[1142]: [1142 : 1167 INFO]PEF.c: Power Action:needed(0) action(0); Alert Action: needed(1) action(17)
BMC hsam[1426]: FRU /chassis-1/controller-b/cna-4 LED on
BMC hsam[1426]: HSAM OS(bmc):cmd(set) FLD(cna-1):fault(Overcurrent Protection Fault)
BMC IPMIMain[1142]: [1142 : 1167 INFO]PEF.c: EventFilter: event on sensor(#0x5d dir:3) match (15) ALERT
BMC IPMIMain[1142]: [1142 : 1167 INFO]PEF.c: Power Action:needed(0) action(0); Alert Action: needed(1) action(17)
BMC IPMIMain[1142]: [1142 : 1167 INFO]PEF.c: EventFilter: event on sensor(#0x5e dir:3) match (15) ALERT
BMC IPMIMain[1142]: [1142 : 1167 INFO]PEF.c: Power Action:needed(0) action(0); Alert Action: needed(1) action(17)
- The issue remains after PSUs and/or controller re-seat or replacement.