How to Handle watchdog resets (WDR)

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 20,029

Visibility:: Public

Votes:: 26

Category:: fas-systems

Specialty:: hw

Last Updated:

Applies to

ONTAP 9
All FAS/AFF systems
Watchdog reset reboot events
HA Group Notification from node (REBOOT (panic)) ALERT
- PANIC : watchdog nmi on cpu
HA Group Notification (REBOOT (watchdog reset)) ALERT

Description

What is a watchdog reset?

A watchdog reset is a mechanism used by most computer systems to automatically restart the system if it encounters an unrecoverable error or becomes unresponsive.

The concept of a watchdog timer is widely used across computer hardware and Networking industry, not just by one company. It is a standard practice because it helps ensure that systems can recover from unexpected problems without human intervention.

In case of a watchdog reset occurrence, it is vital to understand the cause of the restart, using procedures listed below and No hardware should be replaced unless the root cause is a hardware issue based on the log analysis.

Watchdog reset function is to serve as an automatic server restart in the event the system encounters an unrecoverable system error.
The watchdog implemented by NetApp uses a two-level timer with different actions associated with each level of time.
The Level 1/Level 2 Watchdog operation is not proprietary to NetApp and is used throughout the hardware industry.

Level Type

Description

Level 1: Timeout

The storage appliance attempts to panic and dump the core in response to a non-maskable interrupt.

An L1 watchdog is issued if the timer is not reset within 1.5 seconds.
Once an L1 watchdog is successfully issued, the system returns to service and a core file is written, allowing NetApp to determine the root cause of the hang.

Level 2: Reset

The storage appliance resets through a hard reset signal sent from the timer.

An L2 watchdog is issued if the watchdog timer is not reset within two seconds after the L1 watchdog.
The L2 watchdog does not generate a Core dump.
It is not necessary to ‘recover’ from a watchdog timeout or watchdog reset, as both of these events are recovery mechanisms for other failures.
- The objective instead is to identify the failure(s) that caused the watchdog event.

How to Identify a Watchdog NMI Panic

Panic message can be found in Ontap event logs as below or in SP/BMC "system log" command output.

::> event log show -severity * -message-name panic*

Up Node that performed Takeover

Fri Nov 18 01:20:54 -0600 [NetApp01: cf_main: cf.fsm.takeover.panic:alert]: Failover monitor: takeover attempted after partner panic. Fri Nov 18 01:21:37 -0600 [NetApp01: cf_main: callhome.sfo.takeover.panic:EMERGENCY]: Call home for CONTROLLER TAKEOVER COMPLETE PANIC

Partner node, following reboot

Fri Nov 18 01:42:35 -0600 [NetApp02: splog_main: mgr.stack.string:notice]: Panic string: watchdog nmi on cpu 3, hang cpu is 0 in SK process wafl_scan_exempt on release 9.9.1P7 (C)

Platform	Article
AFF A50, AFF A30, AFF A20, AFF C60, AFF C30, ASA A50, ASA A30, ASA A20, ASA C30, FAS50	Handling Watchdog Resets on the AFF A20 / AFF A30 / AFF C30 / AFF A50 / AFF C60 / FAS50
AFF A1K, AFF A90, AFF A70, AFF C80, AFX 1K, ASA A1K, ASA A90, ASA A70, FAS90, FAS70	Handling L2 Watchdog Resets on the FAS90 / FAS70 / ASA A1K / ASA A90 / ASA A70 / AFF A1K / AFF A90 / AFF A70 / AFF C80
FAS2820	Handling L2 Watchdog Resets on the FAS2820
AFF A900, ASA A900, FAS9500	Handling L2 Watchdog Resets on the AFF A900 and FAS9500 Platform
AFF A250, AFF C250,ASA A250, ASA C250, FAS500f	Handling L2 Watchdog Resets on the AFF A250 / FAS500f / AFF C250
AFF A400, AFF C400, ASA A400, ASA C400, FAS8300, FAS8700	Handling L2 Watchdog Resets on the AFF A400 / AFF C400 / FAS8700 / FAS8300
AFF A320	Handling L2 Watchdog Resets on the AFF A320 Platform
AFF A800, AFF C800, ASA A800, ASA C800	Handling L2 Watchdog Resets on the AFF A800 and AFF C800 Platform
AFF A700s	Handling L2 Watchdog Resets on the AFF A700s Platform
AFF A300, AFF A220, AFF A150, AFF C190, ASA A150, FAS8200, FAS2750, FAS2720	Handling L2 Watchdog Resets on the AFF A300 / AFF A220 / AFF A150 / AFF C190 / ASA A150 / FAS8200 / FAS2750 / FAS2720
AFF A700, FAS9000	Handling L2 Watchdog Resets on the AFF A700 and FAS9000 platforms
FAS8020, FAS8040, FAS8060, FAS8080, AFF8020, AFF8040, AFF8060, AFF8080	Handling L2 Watchdog Resets on the FAS8020 / FAS8040 / FAS8060 / FAS8080 / AFF8020 / AFF8040 / AFF8060 / AFF8080
FAS2620, FAS2650, AFF A200	Handling L2 Watchdog Resets on the FAS2620 / FAS2650 / AFF A200
FAS2520, FAS2552, FAS2554	Handling L2 Watchdog Resets on the FAS2520 / FAS2552 / FAS2554
FAS3250	Handling L2 Watchdog Resets on the FAS3250

Additional Information

For further assistance, contact NetApp Technical Support and reference this article along with the data collected.