A400 node is unable to power on after Watchdog 2 Timer expired (OEM)
Applies to
- AFF A400
- FAS8300
- ONTAP 9
- BMC (Baseboard management Controller)
Issue
- The node unexpectedly shuts down and the partner node takes over due to loss of heartbeat:
Tue Jun 17 17:38:08 [partner_node: kltp: clam.heartbeat.state.change:info]: Heartbeats to node (name=source_node, ID=1001) are Failing.Tue Jun 17 17:38:19 [partner_node: cf_main: cf.fsm.takeover.noHeartbeat:alert]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node.Tue Jun 17 17:38:19 [partner_node: cf_main: cf.fsm.stateTransit:info]: Failover monitor: UP --> TAKEOVER- BMC
system log selshows watchdog messages:
a7 | 06/17/2025 | 12:18:14 | Power Unit #0xb2 | Power on | Asserted | from channel 1a8 | 06/17/2025 | 12:20:39 | Watchdog 2 #0xb1 | Timer expired (OEM) | Asserteda9 | 06/17/2025 | 12:20:45 | Power Unit #0xb2 | Power on | Asserted | from channel 1aa | 06/17/2025 | 12:21:10 | Power Unit #0xb2 | Power on | Asserted | from channel 1ab | 06/17/2025 | 12:21:57 | Cli_Reboot #0xb8 | bmc cli command reboot | Assertedac | 01/01/2000 | 00:03:15 | Power Unit #0xb2 | Power on | Asserted | from channel 1ad | 01/01/2000 | 00:05:39 | Watchdog 2 #0xb1 | Timer expired (OEM) | Assertedae | 01/01/2000 | 00:07:49 | Watchdog 2 #0xb1 | Timer expired (OEM) | Assertedaf | 01/01/2000 | 00:15:44 | Power Unit #0xb2 | Power on | Asserted | from channel 1b0 | 01/01/2000 | 00:18:10 | Watchdog 2 #0xb1 | Timer expired (OEM) | Assertedb1 | 01/01/2000 | 00:20:21 | Watchdog 2 #0xb1 | Timer expired (OEM) | Assertedb2 | 01/01/2000 | 00:29:53 | Power Unit #0xb2 | Power on | Asserted | from channel 1b3 | 01/01/2000 | 00:32:22 | Watchdog 2 #0xb1 | Timer expired (OEM) | Assertedb4 | 01/01/2000 | 00:34:33 | Watchdog 2 #0xb1 | Timer expired (OEM) | AssertedMultiple sensors displayNo Readingstatus after shutdown
From BMC system log debug output:
PVCCIN_CPU0 | 01h | ns | 21.1 | No Reading
PVCCIN_CPU1 | 02h | ns | 21.1 | No Reading
PVDDQ_ABC | 03h | ns | 21.1 | No Reading
PVDDQ_DEF | 04h | ns | 21.1 | No Reading
PVDDQ_GHI | 05h | ns | 21.1 | No Reading
PVDDQ_KLM | 06h | ns | 21.1 | No Reading
P1V05_PCH | 07h | ns | 21.1 | No Reading
CX5_Temp1 | 14h | ns | 7.5 | No Reading
CX5_Temp2 | 15h | ns | 7.6 | No Reading
- The node fails to power on after executing the
system power oncommand from the BMC prompt. - The node does not boot even after performing a
system power cyclefrom the BMC. - The node fails to power on after motherboard reseat.
