Node perform takeover/giveback automatically due to DIMM UECC error
Applies to
- ONTAP 9
- FAS/AFF system
Issue
- Node1 experience a takeover with the following alert, and automatically give back:
System Alert from SP of node1 (APPLIANCE_ASUP_DIMM_UECC_ERROR)CRITICAL
system
log
shows system power cycle is triggered after UECC detected against DIMM-X:
ECC error at DIMM-X: CE-01-2246-03EF3FEB,ADDR 0x2a762b080,(Node(0), Memory controller(0), CH(1), DIMM(0), Rank(0), Bank Group(1), Bank(0x1), Row(0x49ca), Col(0x380)), devtag(0x3f), correrr(0x0) Uncorrectable Machine Check Error at CPU9. BDWL_HA0 Error: STATUS<0xbe00000000010091>(Val,UnCor,Enable,MiscV,AddrV,PCC,CorrSts(0),CorrCnt(0),ExtErr(0x1),ErrCode(Channel 1, Read),ErrCode(0x91)),MISC<0x0000000150020286>(HaDbBank(0),PE(0),ReqOpcode(0xa),RNID(0),RTID(0x1),HTID(0x1))
Requesting SP to power cycle the filer to attempt to clear DRAM UECC
- The following log is found in
sp_system_event_
log
:
Record697: Sat Jun 22 16:47:56.115990 2024 [IPMI Event.critical]: DIMM UECC Fatal Error detected by Storage OS
Record698: Sat Jun 22 16:47:56.127985 2024 [Trap Event.critical]: hwassistdimm_uecc_error (32)
Record699: Sat Jun 22 16:47:56.164909 2024 [Trap Event.critical]: SNMPdimm_uecc_error (32)
Record700: Sat Jun 22 16:47:56.468211 2024 [IPMI Event.critical]: System power cycle
- Check DIMM-X status in
DIMM-INFO.XML
after giveback, the status shows OK.
DIMM ID Slot Name Status
2 DIMM-X ok