Node panics due to ECC error caused by a faulty DIMM

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 1,154

Visibility:: Public

Votes:: 1

Category:: ontap-9

Specialty:: hw

Last Updated:

Applies to

ONTAP 9
FAS systems
AFF systems

Issue

Node suddenly rebooted following cluster alerts

HA Group Notification from Node13 (NODE(S) OUT OF CLUSTER QUORUM) EMERGENCY
HA Group Notification (PARTNER REBOOT (CONTROLLER TAKEOVER)) NOTICE

While EMS logs shows the following

ECC error at DIMM-7: 2C-02-1909-20F18D8D,ADDR 0x208455e900,(Node(0), Memory controller(1), CH(3), DIMM(0), Rank(1), Bank Group(0), Bank(0x1), Row(0x10045), Col(0x1d0)) SKL_IMC1 Error:

Fri Dec 20 16:26:31 2024 SRAM record type(CPU) from Data ONTAP: socket(0) core(4) bank(8)
Fri Dec 20 16:26:31 2024 SRAM record type(LOG) from Data ONTAP: UECC Addr 0x208455e900
Fri Dec 20 16:26:31 2024 SRAM record type(DIMM) from Data ONTAP: slot(7)

In some cases node may fail to boot with following panic string:

PANIC: ECC error at DIMM-2: CE-03-2040-176B3357,ADDR 0x558b31e40,(Node(0), Memory controller(0), CH(1), DIMM(0), Rank(0), Bank Group(3), Bank(0x3), Row(0x9633), Col(0xf8)) Uncorrectable Machine Check Error at CPU9. BDWL_HA0 Error: STATUS<0xbe00000000010091>(Val,UnCor,Enable,MiscV,AddrV,PCC,CorrSts(0),CorrCnt(0),ExtErr(0x1),ErrCode(Channel 1, Read)ErrCode(0x91))MISC<0x000000044056d686>(HaDbBank(0),PE(0),ReqOpcode(0x22),RNID(0),RTID(0x2b),HTID(0x6b))ADDR<0x0000000558b31e40>((0x558b31e40)).  in process idle: cpu9 on release 9.7P10 (C) on Sun Nov 13 00:57:56 IST 2022

BMC events all report DIMM trap:

Record 1382: Tue Oct 21 10:00:02.423402 2025 [IPMI Event.critical]: DIMM UECC Fatal Error detected by Storage OS Record 1383: Tue Oct 21 10:00:02.463052 2025 [Trap Event.critical]: hwassist dimm_uecc_error (32)