AFF A400/FAS8300/FAS8700 reboot due to L2 watchdog reset
- Views:
- 1,685
- Visibility:
- Public
- Votes:
- 0
- Category:
- fas-systems
- Specialty:
- hw
- Last Updated:
- 11/6/2024, 1:44:39 PM
Applies to
- ONTAP 9
- AFF A400
- FAS 8300
- FAS 8700
Issue
- Unexpected node reboot due to L2 watchdog reset.
- ONTAP Event Messages (EMS) errors from surviving partner node:
NOTICE cf.hwassist.takeoverTrapRecv: hw_assist: Received takeover hw_assist alert from partner(node-01), system_down because reset_via_sp.
NOTICE cf.hwassist.takeoverTrapRecv: hw_assist: Received takeover hw_assist alert from partner(node-01), system_down because l2_watchdog_reset.
OR
[node-1: cf_hwassist: cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(node_name-2), system_down because power_off_via_sp.
- ONTAP PANIC message from the impacted node:
[node-2: send_boot_msg_thread: mgr.stack.string:notice]: Panic string: watchdog nmi on cpu 8, hang cpu is 1 in process idle: cpu8 on release...
- BMC logs report the NMI errors:
BMC> system log sel
df | 11/06/2021 | 01:58:24 | System Event #0xff | Timestamp Clock Sync | Asserted
e0 | 11/06/2021 | 02:12:53 | Watchdog 2 #0xb1 | Timer interrupt (NMI/SMS/OS) | Asserted
e1 | 11/06/2021 | 02:12:53 | Critical Interrupt #0xb0 | NMI/Diag Interrupt | Asserted
e2 | 11/06/2021 | 02:12:56 | Watchdog 2 #0xb1 | Hard reset (NMI/SMS/OS) | Asserted
e3 | 11/06/2021 | 02:12:56 | Power Unit #0xb2 | Power reset | Asserted | from channel 15