SG1100 Panic: CPUs Not Responding to MCE Broadcast

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 63

Visibility:: Public

Votes:: 0

Category:: storagegrid-webscale

Specialty:: sgrid

Last Updated:

Applies to

NetApp StorageGRID Admin Node
SG1100

Issue

The Admin Node (SG1100) experienced unexpected reboots and temporary unreachability.

BMC Logs CPU Catalog Error (CATERR)

331 Mar/13/2026 01:37:41 [Information] [Host Res Warning] [OEM] Host Partition Reset triggered 255 minutes - Asserted
330 Mar/13/2026 01:36:37 [Critical] [CATERR] [Processor] IERR - Asserted
329 Mar/13/2026 01:35:10 [Critical] [CATERR] [Processor] Machine Check Exception (MCERR) - Asserted

storagegrid_crash_dmesg.log indicates that kernel triggered a panic due to CPUs not responding to MCE broadcast

[5048608.845286] watchdog: BUG: soft lockup - CPU#75 stuck for 78s! [prometheus-node:46612]
...
[5048616.006133] mce: CPUs not responding to MCE broadcast (may include false positives): 10,58
[5048616.006138] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler