Node Down Due to Uncorrectable ECC Memory Error
Applies to
- ONTAP 9
- AFF/FAS
Issue
- One of the nodes in the cluster unexpectedly went down and was taken over by its partner node.
- Event logs reported uncorrectable ECC memory and CPU machine check errors (ECC error at DIMM-8 and Uncorrectable Machine Check Error on CPU 38), which triggered a hardware-assisted takeover.
ECC error at DIMM-8: 2C-0F-1949-254B5241, ADDR 0x5484e80b00Uncorrectable Machine Check Error at CPU 38. SKL_IMC1 Error: STATUS(VALID,UC,EN,MISCV,ADDRV,PCC,CORR_ERR_STATUS(0),CORR_ERR_CNT(0),MSCOD(0x101),MCCOD(0x90))...hw_assist: Received takeover hw_assist alert from partner (cnanec01afpd02-n05), system_down because power_cycle_via_sp. - Node came back online and was waiting for giveback.
