StorageGRID appliance SG5700 reboots unexpectedly due to watchdog timeout
Applies to
- NetApp StorageGRID 11.5 and above
- StorageGRID Appliance SG5712
Issue
- Alert email of unexpected reboot for one or more nodes:
Unexpected node reboot (1 alert)
A node rebooted unexpectedly within the last 24 hours.
Recommended actions
1. Monitor this alert. The alert will be cleared after 24 hours. However, if the node reboots unexpectedly again, this alert will be triggered again.
2. If you cannot resolve the alert, there might be a hardware failure. Contact technical support.
________________________________________
dc1-sn1
Node dc1-sn1
Site DC1
Severity Major
Time triggered WKD MMM DD hh:mm:ss UTC YYYY
Job miscd
- The node is rebooted due to watchdog timeout:
- Collecting log files and system data on affected nodes
- Extract the .tar.gz log file
- Locate the crash dmesg file:
base-os-logs\run\mount-tmp\pge-actv-root\var\log\storagegrid_crash_dmesg.YYYYMMDDhhmmss.log.gz
- Extract and open the crash dmesg file, verify the node is rebooted due to watchdog timeout:
[sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr fired, interruptCount = 1
[sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr logging memory usage
...
[sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr logging CPU backtrace
...
[sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr logging blocked tasks
...
[sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr logging all ftrace buffers
...
[sss.uuuuuu] fpga_pci: fpgaIsr: fpgaIsr serviced watchdog