StorageGRID appliance SG5700 reboots unexpectedly due to watchdog timeout

Last updated

Apr 25, 2025
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 688

Visibility:: Public

Votes:: 0

Category:: storagegrid

Specialty:: sgrid

Last Updated:: 4/25/2025, 10:22:32 PM

Applies to

NetApp StorageGRID 11.5 and above
StorageGRID Appliances SG5700

Issue

Alert email of unexpected reboot for one or more nodes:

Unexpected node reboot (1 alert) 

A node rebooted unexpectedly within the last 24 hours.

Recommended actions

1. Monitor this alert. The alert will be cleared after 24 hours. However, if the node reboots unexpectedly again, this alert will be triggered again.

2. If you cannot resolve the alert, there might be a hardware failure. Contact technical support.

________________________________________

dc1-sn1 

Node    dc1-sn1

Site    DC1

Severity    Major

Time triggered    WKD MMM DD hh:mm:ss UTC YYYY

Job    miscd

High CPU utilization reported on the node in question leading up to the reboot. The CPU usage can be viewed under the Nodes view in StorageGRID Grid Manager, and under the (Support > Metrics) Node or Node (Internal Use) views.
The node is rebooted due to watchdog timeout:
- Collecting log files and system data on affected nodes
- Extract the .tar.gz log file
- Locate the crash dmesg file: base-os-logs\run\mount-tmp\pge-actv-root\var\log\storagegrid_crash_dmesg.YYYYMMDDhhmmss.log.gz
- Extract and open the crash dmesg file, verify the node is rebooted due to watchdog timeout:

[sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr fired, interruptCount = 1 [sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr logging memory usage ... [sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr logging CPU backtrace ... [sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr logging blocked tasks ... [sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr logging all ftrace buffers ... [sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr serviced watchdog