Skip to main content
NetApp Knowledge Base

StorageGRID Node in Unknown State Due to Machine Check Exception and Kernel panic

Views:
11
Visibility:
Public
Votes:
0
Category:
storagegrid
Specialty:
sgrid
Last Updated:

Applies to

  • NetApp StorageGRID Appliance
  • StorageGRID 11.x

Issue

  • A StorageGRID node became completely unresponsive and entered in “unknown state.”
  • StorageGRID node is unreachable over both Grid and Admin networks, could not be pinged, and did not respond to remote reboot attempts. Console logs showed repeated machine check exceptions and kernel panics.
  • From serial console, below events can be seen :

APIC 0 microcode 7000013
[   14.128696] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 880000420 0800091
[   14.136413] mce: [Hardware Error]: TSC 41efd53e80 MISC 4900040804080400 PPIN 5593508c1574f86
[   14.144898] mce: [Hardware Error]: PROCESSOR 0:50663 TIME 1769540988 SOCKET 0 APIC 0 microcode 7000013
[   14.154177] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 880000420 0800091
[   14.161893] mce: [Hardware Error]: TSC 41efd5b48c MISC 4908525c125c0400 PPIN 5593508c1574f86
[   14.170388] mce: [Hardware Error]: PROCESSOR 0:50663 TIME 1769540988 SOCKET 0 APIC 0 microcode 7000013
[   14.179696] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: c80000820 0800091
[   14.179696] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: c80000820 0800091
[   14.187163] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 7: fe00004400010091
[   19.396268] Kernel panic - not syncing: Panicing machine check CPU died
[   20.430312] Shutting down cpus with NMI
[   20.430321] Kernel Offset: 0x6800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xfffffffffffbffff)
-----------------------------------------------------------------------------------------------------------------

SG login: [  477.304189] mce: [Hardware Error]: CPU 8: Machine Check Exception: 5 Bank 7: fe000b8000010091
[  477.312677] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f118daf065d>
[  477.319258] mce: [Hardware Error]: TSC 16d447eeff9 ADDR 3301474c0 MISC 4022a286 PPIN 5593508c1574f86
[  477.328434] mce: [Hardware Error]: PROCESSOR 0:50663 TIME 1769537008 SOCKET 0 APIC 1 microcode 7000013
[  477.337692] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[  477.346397] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 7: fe000bc000010091
[  477.354878] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff893e3d08> {mwait_idle_with_hints.constprop.0+0x48/0x90}
[  477.365352] mce: [Hardware Error]: TSC 16d447ef69c ADDR 3301474c0 MISC 4022a286 PPIN 5593508c1574f86
[  477.374527] mce: [Hardware Error]: PROCESSOR 0:50663 TIME 1769537008 SOCKET 0 APIC 0 microcode 7000013
[  477.383785] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[  477.392475] mce: [Hardware Error]: Machine check: Processor context corrupt
[  477.399399] Kernel panic - not syncing: Fatal machine check

  • ​Unable to collect the lumberjack logs for affected node, reports "Error": "Unable to connect to ABC-NODE on port 22: No route to host - connect(2) for x.x.x.x"

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.