Correctable ECC memory error causes driveAvailable
Applies to
- SF9605
- NetApp Element software
Issue
nodeOffline
alert is raised and becomes resolved in a minute.driveAvailable
alerts for multiple drives are raised.kern.log
reports error:
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]: Error 0, type: corrected
{1}[Hardware Error]: fru_text: B1
{1}[Hardware Error]: section_type: memory error
{1}[Hardware Error]: error_status: 0x0000000000000400
{1}[Hardware Error]: physical_address: 0x0000003ff8638200
{1}[Hardware Error]: node: 1 card: 0 module: 0 rank: 1 bank: 3 row: 65281 column: 532
{1}[Hardware Error]: error_type: 2, single-bit ECC
mce: [Hardware Error]: Machine check events logged
EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
EDAC sbridge MC1: TSC 0
EDAC sbridge MC1: ADDR 3ff8638200
EDAC sbridge MC1: MISC 0
EDAC sbridge MC1: PROCESSOR 0:306f2 TIME 1679839488 SOCKET 0 APIC 0
EDAC MC0: 0 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 or CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 or CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 or CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 (slot:0 page:0x3ff8638 offset:0x200 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:0 channel_mask:3 rank:1)