How to Handle watchdog resets (WDR)
Applies to
- ONTAP 9
- All FAS/AFF systems
- Watchdog reset reboot events
- HA Group Notification from node (REBOOT (panic)) ALERT
- PANIC : watchdog nmi on cpu
- HA Group Notification (REBOOT (watchdog reset)) ALERT
Description
What is a watchdog reset?
A watchdog is an independent timer that monitors the progress of the main controller running ONTAP.
- Its function is to serve as an automatic server restart in the event the system encounters an unrecoverable system error.
- The watchdog implemented by NetApp uses a two-level timer with different actions associated with each level of time.
- The Level 1/Level 2 Watchdog operation is not proprietary to NetApp and is used throughout the hardware industry.
Level Type | Description |
Level 1: Timeout |
The storage appliance attempts to panic and dump the core in response to a non-maskable interrupt.
|
Level 2: Reset |
The storage appliance resets through a hard reset signal sent from the timer.
|
How to Identify a Watchdog NMI Panic
event log show -severity * -message-name panic*
- Up Node that performed Takeover
Fri Nov 18 01:20:54 -0600 [NetApp01: cf_main: cf.fsm.takeover.panic:alert]: Failover monitor: takeover attempted after partner panic.
Fri Nov 18 01:21:37 -0600 [NetApp01: cf_main: callhome.sfo.takeover.panic:EMERGENCY]: Call home for CONTROLLER TAKEOVER COMPLETE PANIC
- Partner node, following reboot
Fri Nov 18 01:42:35 -0600 [NetApp02: splog_main: mgr.stack.string:notice]: Panic string: watchdog nmi on cpu 3, hang cpu is 0 in SK process wafl_scan_exempt on release 9.9.1P7 (C)
Additional Information
For further assistance, contact NetApp Technical Support and reference this article along with the data collected.