Node panics due to ECC error caused by a faulty DIMM
Applies to
- ONTAP 9
- FAS systems
- AFF systems
Issue
- Node suddenly rebooted following cluster alerts
HA Group Notification from Node13 (NODE(S) OUT OF CLUSTER QUORUM) EMERGENCY HA Group Notification (PARTNER REBOOT (CONTROLLER TAKEOVER)) NOTICE
While EMS logs shows the following
ECC error at DIMM-7: 2C-02-1909-20F18D8D,ADDR 0x208455e900,(Node(0), Memory controller(1), CH(3), DIMM(0), Rank(1), Bank Group(0), Bank(0x1), Row(0x10045), Col(0x1d0)) SKL_IMC1 Error: Fri Dec 20 16:26:31 2024 SRAM record type(CPU) from Data ONTAP: socket(0) core(4) bank(8) Fri Dec 20 16:26:31 2024 SRAM record type(LOG) from Data ONTAP: UECC Addr 0x208455e900 Fri Dec 20 16:26:31 2024 SRAM record type(DIMM) from Data ONTAP: slot(7)
- In some cases node may fail to boot with following panic string:
PANIC: ECC error at DIMM-2: CE-03-2040-176B3357,ADDR 0x558b31e40,(Node(0), Memory controller(0), CH(1), DIMM(0), Rank(0), Bank Group(3), Bank(0x3), Row(0x9633), Col(0xf8)) Uncorrectable Machine Check Error at CPU9. BDWL_HA0 Error: STATUS<0xbe00000000010091>(Val,UnCor,Enable,MiscV,AddrV,PCC,CorrSts(0),CorrCnt(0),ExtErr(0x1),ErrCode(Channel 1, Read)ErrCode(0x91))MISC<0x000000044056d686>(HaDbBank(0),PE(0),ReqOpcode(0x22),RNID(0),RTID(0x2b),HTID(0x6b))ADDR<0x0000000558b31e40>((0x558b31e40)). in process idle: cpu9 on release 9.7P10 (C) on Sun Nov 13 00:57:56 IST 2022