CONTAP-665327: Auto case was not created for DIMM UMCE where node takeover occurred
Issue
- A node takeover following an uncorrectable ECC error detected on a DIMM module.
ECC error at DIMM-8: 2C-0F-1949-254B5241,ADDR 0x5484e80b00,(Node(1), Memory controller(1), CH(3), DIMM(0), Rank(0), Bank Group(2), Bank(0x3), Row(0xe066), Col(0x0))
Uncorrectable Machine Check Error at CPU38. SKL_IMC1 Error: STATUS<0xbe00000001010090>(VALID,UC,EN,MISCV,ADDRV,PCC,CORR_ERR_STATUS(0),CORR_ERR_CNT(0),MSCOD(0x101),MCCOD(0x90))MISC<0x20000ac527e02086>(DataErrorChunk(0x2),McCmdChnl(0),McCmdMemRegion(0),McCmdOpcode(0xa),McCmdVld,SmiAD,SmiMsgClass(0),SmiOpcode(0xa),TrkId(0x13f),Error_Type(0x4),ADDRMODE(0x2),ADDRLSB(0x6))ADDR<0x0000005484e80b00>(HIPHYADDR(0x54),LOPHYADDR(0x213a02c))(Node(1), Memory controller(1), CH(0), DIMM(0), Rank(0), Bank Group(2), Bank(0x3), Row(0xe066), Col(0x0), Device(0), DQ-Burst(2,0), DQ-Burst(1,4),
Requesting SP to power cycle the filer to attempt to clear the Machine Check Event- The Service Processor (SP) initiated a power cycle to recover from the error, which caused a manual takeover by the HA partner node.
cf_hwassist: cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(cnanec01afpd02-n05), system_down because power_cycle_via_sp.- However, no automatic support (ASUP) case was generated for this critical hardware failure event.
