Handling watchdog resets (WDR)
Applies to
Watchdog reset
Answer
What is a watchdog reset?
A watchdog is an independent timer that monitors the progress of the main controller running Data ONTAP. Its function is to serve as an automatic server restart in the event the system encounters an unrecoverable system error.
The watchdog implemented by NetApp uses a two-level timer with different actions associated with each level of time.
- Level 1: Timeout: The storage appliance attempts to panic and dump the core in response to a non-maskable interrupt. Once a L1 watchdog is successfully issued, the system returns to service and a core file is written, allowing NetApp to determine the root cause of the hang. A L1 watchdog is issued if the timer is not reset within 1.5 seconds.
- Level 2: Reset: The storage appliance resets through a hard reset signal sent from the timer. A L2 watchdog is issued if the watchdog timer is not reset within two seconds after the L1 watchdog. The L2 watchdog does not generate a Core dump
It is not necessary to ‘recover’ from a watchdog timeout or watchdog reset, as both of these events are recovery mechanisms for other failures. The objective instead is to identify the failure(s) that caused the watchdog event.
What is the appropriate response to a watchdog timeout (L1 Watchdog Event)?
A watchdog timeout should be treated just like any other system panic. The associated backtrace and/or the core should be analyzed for the possible root cause(s). A giveback should be performed if necessary.
What is the appropriate response to a watchdog reset (L2 Watchdog Event)?
DO NOT SIMPLY GIVEBACK AND MONITOR as data collection is required |
Please collect the following data to help diagnose the cause of a watchdog reset:
- AutoSupport messages
- Console logs before, during, and after the watchdog event (if possible)
- ssram log (/etc/log/ssram/ssram.log or /mroot/etc/log/ssram/ssram.log) - FAS62xx, FAS80x0 only
- Collection of SP and BMC logs will be required for proper analysis
Note: No hardware should be replaced unless the root cause is a hardware issue based on the available log analysis.
Platform | Article |
---|---|
AFF A80X0, FAS80X0 | Handling L2 Watchdog Resets on the FAS 80X0 and AFF A80X0 platforms |
FAS25XX | Handling L2 Watchdog Resets on the FAS 25XX platforms |
AFF A700, FAS9000 | Handling L2 Watchdog Resets on the AFF A700 and FAS9000 platforms |
AFF A200, FAS26XX | Handling L2 Watchdog Resets on the FAS26XX and AFF A200 platforms |
AFF A220, AFF C190, FAS27XX | Handling L2 Watchdog Resets on the FAS27XX, AFF A220, and AFF C190 platforms |
AFF A400, FAS8300, FAS8700 | Handling L2 Watchdog Resets on the AFF A400, FAS8300, and FAS8700 |
AFF A700s | Handling L2 Watchdog Resets on the AFF A700s Platform |
AFF A300, FAS8200 | Handling L2 Watchdog Resets on the FAS8200 and AFF A300 platforms |
AFF A800 | Handling L2 Watchdog Resets on the AFF A800 Platform |
AFF A320 | Handling L2 Watchdog Resets on the AFF A320 Platform |
AFF A900, FAS9500 | Handling L2 Watchdog Resets on the AFF A900 and FAS9500 Platform |
AFF A250, FAS500f, AFF C250 | Handling L2 Watchdog Resets on the AFF A250, FAS500f, and Aff C250 |
Additional Information
For further assistance, contact NetApp Technical Support and reference this article along with the data collected.