Skip to main content
NetApp Knowledge Base

How to Handle watchdog resets (WDR)

Views:
11,719
Visibility:
Public
Votes:
24
Category:
fas-systems
Specialty:
hw
Last Updated:

 

Applies to

  • ONTAP 9
  • All FAS/AFF systems
  • Watchdog reset reboot events
  • HA Group Notification from node (REBOOT (panic)) ALERT
    • PANIC  : watchdog nmi on cpu
  • HA Group Notification (REBOOT (watchdog reset)) ALERT

Description

What is a watchdog reset?

A watchdog is an independent timer that monitors the progress of the main controller running ONTAP.

  • Its function is to serve as an automatic server restart in the event the system encounters an unrecoverable system error.
  • The watchdog implemented by NetApp uses a two-level timer with different actions associated with each level of time.
  • The Level 1/Level 2 Watchdog operation is not proprietary to NetApp and is used throughout the hardware industry.

 

Level Type Description
Level 1: Timeout

The storage appliance attempts to panic and dump the core in response to a non-maskable interrupt.

  • An L1 watchdog is issued if the timer is not reset within 1.5 seconds.
  • Once an L1 watchdog is successfully issued, the system returns to service and a core file is written, allowing NetApp to determine the root cause of the hang. 
Level 2: Reset

The storage appliance resets through a hard reset signal sent from the timer. 

  • An L2 watchdog is issued if the watchdog timer is not reset within two seconds after the L1 watchdog. 
  • The L2 watchdog does not generate a Core dump.
  • It is not necessary to ‘recover’ from a watchdog timeout or watchdog reset, as both of these events are recovery mechanisms for other failures.
    • The objective instead is to identify the failure(s) that caused the watchdog event.
How to Identify a Watchdog NMI Panic

event log show -severity * -message-name panic*

  • Up Node that performed Takeover

Fri Nov 18 01:20:54 -0600 [NetApp01: cf_main: cf.fsm.takeover.panic:alert]: Failover monitor: takeover attempted after partner panic.
Fri Nov 18 01:21:37 -0600 [NetApp01: cf_main: callhome.sfo.takeover.panic:EMERGENCY]: Call home for CONTROLLER TAKEOVER COMPLETE PANIC

  • Partner node, following reboot

Fri Nov 18 01:42:35 -0600 [NetApp02: splog_main: mgr.stack.string:notice]: Panic string: watchdog nmi on cpu 3, hang cpu is 0 in SK process wafl_scan_exempt on release 9.9.1P7 (C)

Platform Article

FAS8020 / FAS8040 / FAS8060 / FAS8080 / AFF8020 / AFF8040 / AFF8060 / AFF8080

Handling L2 Watchdog Resets on the FAS8020 / FAS8040 / FAS8060 / FAS8080 / AFF8020 / AFF8040 / AFF8060 / AFF8080

FAS2520 / FAS2552 / FAS2554

Handling L2 Watchdog Resets on the FAS2520 / FAS2552 / FAS2554
AFF A700 / FAS9000 Handling L2 Watchdog Resets on the AFF A700 and FAS9000 platforms

FAS2620 / FAS2650 / AFF A200

Handling L2 Watchdog Resets on the FAS2620 / FAS2650 / AFF A200

AFF A220 / AFF A150 / AFF C190 / FAS2750 / FAS2720

Handling L2 Watchdog Resets on the AFF A220 / AFF A150 / AFF C190 / FAS2750 / FAS2720

AFF A400 / AFF C400 / FAS8700 / FAS8300

Handling L2 Watchdog Resets on the AFF A400 / AFF C400 / FAS8700 / FAS8300
AFF A700s Handling L2 Watchdog Resets on the AFF A700s Platform
AFF A300 / FAS8200 Handling L2 Watchdog Resets on the FAS8200 and AFF A300 platforms

AFF A800 / AFF C800

Handling L2 Watchdog Resets on the AFF A800 and AFF C800 Platform
AFF A320 Handling L2 Watchdog Resets on the AFF A320 Platform
AFF A900 / FAS9500 Handling L2 Watchdog Resets on the AFF A900 and FAS9500 Platform
AFF A250 / FAS500f / AFF C250 Handling L2 Watchdog Resets on the AFF A250 / FAS500f / AFF C250

Additional Information

For further assistance, contact NetApp Technical Support and reference this article along with the data collected.

 

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.