StorageGRID Node in Unknown State Due to Machine Check Exception and Kernel panic
Applies to
- NetApp StorageGRID Appliance
- StorageGRID 11.x
Issue
- A StorageGRID node became completely unresponsive and entered in “unknown state.”
- StorageGRID node is unreachable over both Grid and Admin networks, could not be pinged, and did not respond to remote reboot attempts. Console logs showed repeated machine check exceptions and kernel panics.
- From serial console, below events can be seen :
APIC 0 microcode 7000013[ 14.128696] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 880000420 0800091[ 14.136413] mce: [Hardware Error]: TSC 41efd53e80 MISC 4900040804080400 PPIN 5593508c1574f86[ 14.144898] mce: [Hardware Error]: PROCESSOR 0:50663 TIME 1769540988 SOCKET 0 APIC 0 microcode 7000013[ 14.154177] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 880000420 0800091[ 14.161893] mce: [Hardware Error]: TSC 41efd5b48c MISC 4908525c125c0400 PPIN 5593508c1574f86[ 14.170388] mce: [Hardware Error]: PROCESSOR 0:50663 TIME 1769540988 SOCKET 0 APIC 0 microcode 7000013[ 14.179696] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: c80000820 0800091[ 14.179696] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: c80000820 0800091[ 14.187163] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 7: fe00004400010091[ 19.396268] Kernel panic - not syncing: Panicing machine check CPU died[ 20.430312] Shutting down cpus with NMI[ 20.430321] Kernel Offset: 0x6800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xfffffffffffbffff)-----------------------------------------------------------------------------------------------------------------
SG login: [ 477.304189] mce: [Hardware Error]: CPU 8: Machine Check Exception: 5 Bank 7: fe000b8000010091[ 477.312677] mce: [Hardware Error]: RIP !INEXACT! 33:<00007f118daf065d>[ 477.319258] mce: [Hardware Error]: TSC 16d447eeff9 ADDR 3301474c0 MISC 4022a286 PPIN 5593508c1574f86[ 477.328434] mce: [Hardware Error]: PROCESSOR 0:50663 TIME 1769537008 SOCKET 0 APIC 1 microcode 7000013[ 477.337692] mce: [Hardware Error]: Run the above through 'mcelog --ascii'[ 477.346397] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 7: fe000bc000010091[ 477.354878] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff893e3d08> {mwait_idle_with_hints.constprop.0+0x48/0x90}[ 477.365352] mce: [Hardware Error]: TSC 16d447ef69c ADDR 3301474c0 MISC 4022a286 PPIN 5593508c1574f86[ 477.374527] mce: [Hardware Error]: PROCESSOR 0:50663 TIME 1769537008 SOCKET 0 APIC 0 microcode 7000013[ 477.383785] mce: [Hardware Error]: Run the above through 'mcelog --ascii'[ 477.392475] mce: [Hardware Error]: Machine check: Processor context corrupt[ 477.399399] Kernel panic - not syncing: Fatal machine check
- Unable to collect the lumberjack logs for affected node, reports "Error": "Unable to connect to ABC-NODE on port 22: No route to host - connect(2) for x.x.x.x"
