What is a watchdog reset?
A watchdog is an independent timer that monitors the progress of the main controller running Data ONTAP. Its function is to serve as an automatic server restart in the event the system encounters an unrecoverable system error.
The watchdog implemented by NetApp uses a two-level timer with different actions associated with each level of time.
- Level 1: Timeout: The storage appliance attempts to panic and dump the core in response to a non-maskable interrupt. Once a L1 watchdog is successfully issued, the system returns to service and a core file is written, allowing NetApp to determine the root cause of the hang. A L1 watchdog is issued if the timer is not reset within 1.5 seconds.
- Level 2: Reset: The storage appliance resets through a hard reset signal sent from the timer. A L2 watchdog is issued if the watchdog timer is not reset within two seconds after the L1 watchdog. The L2 watchdog does not generate a Core dump
It is not necessary to ‘recover’ from a watchdog timeout or watchdog reset, as both of these events are recovery mechanisms for other failures. The objective instead is to identify the failure(s) that caused the watchdog event.
What is the appropriate response to a watchdog timeout (L1 Watchdog Event)?
A watchdog timeout should be treated just like any other system panic. The associated backtrace and/or the core should be analyzed for the possible root cause(s). A giveback should be performed if necessary.
What is the appropriate response to a watchdog reset (L2 Watchdog Event)?
|DO NOT SIMPLY GIVEBACK AND MONITOR as data collection is required|
Please collect the following data to help diagnose the cause of a watchdog reset:
- AutoSupport messages
- Console logs before, during, and after the watchdog event (if possible)
- ssram log (/etc/log/ssram/ssram.log or /mroot/etc/log/ssram/ssram.log) - FAS62xx, FAS80x0 only
- On systems with a service processor:
- system sensors
- system log
- events all
- sp status -d
Note: No hardware should be replaced unless the root cause is a hardware issue based on the available log analysis.
For further assistance, contact NetApp Technical Support and reference this article along with the data collected.