Appliance compute controller needs attention on StorageGRID due to MCERR
Applies to
Issue
Appliance compute controller needs attention
alert is detectedstoragegrid_crash_dmesg.log
indicates Hardware error
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]: Error 0, type: corrected
{1}[Hardware Error]: section_type: memory error
{1}[Hardware Error]: physical_address: 0x000000267a8e0340
{1}[Hardware Error]: physical_address_mask: 0x00003fffffffffc0
{1}[Hardware Error]: node: 3 card: 2 module: 0 rank: 1 device: 5 row: 40839 column: 176
{1}[Hardware Error]: error_type: 2, single-bit ECC
{1}[Hardware Error]: DIMM location: NODE 2 CPU1_F0
mce: [Hardware Error]: Machine check events logged
EDAC skx MC3: HANDLING MCE MEMORY ERROR
EDAC skx MC3: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
EDAC skx MC3: TSC 0
EDAC skx MC3: ADDR 267a8e0340
EDAC skx MC3: MISC 0
EDAC skx MC3: PROCESSOR 0:50657 TIME 1655421018 SOCKET 0 APIC 0
SEL_TEXT.txt
gathered by BMI Logs & Reports > IPMI Event Log > Text File Type > Download Event Logs indicatesUncorrectable Error
Severity Sensor Name Sensor Type Description
[Critical] [CATERR] [Processor] Machine Check Exception (MCERR) - Asserted
[Critical] [CATERR] [Processor] Machine Check Exception (MCERR) - Asserted
[Warning] [Additional MCE Error] [OEM Record C2] ManufacturerID:001C4C, Extra Information : 0 MSCOD:0010 MCACOD:0134
[Critical] [MCERR] [Processor] Uncorrectable Error - Machine Check Error: Bank 1/CPU 0/Core 1 - Asserted