Home
On Premises
ONTAP 9
ONTAP Hardware
ONTAP Hardware KBs
Node Down Due to Uncorrectable ECC Memory Error

Node Down Due to Uncorrectable ECC Memory Error

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 13

Visibility:: Public

Votes:: 0

Category:: ontap-9

Specialty:: HW

Last Updated:

Applies to

ONTAP 9
AFF/FAS

Issue

One of the nodes in the cluster unexpectedly went down and was taken over by its partner node.
Event logs reported uncorrectable ECC memory and CPU machine check errors (ECC error at DIMM-8 and Uncorrectable Machine Check Error on CPU 38), which triggered a hardware-assisted takeover.

ECC error at DIMM-8: 2C-0F-1949-254B5241, ADDR 0x5484e80b00Uncorrectable Machine Check Error at CPU 38. SKL_IMC1 Error: STATUS(VALID,UC,EN,MISCV,ADDRV,PCC,CORR_ERR_STATUS(0),CORR_ERR_CNT(0),MSCOD(0x101),MCCOD(0x90))...hw_assist: Received takeover hw_assist alert from partner (cnanec01afpd02-n05), system_down because power_cycle_via_sp.
Node came back online and was waiting for giveback.