Uncorrectable memory error on an AFF / FAS system that does not support PPR
- Views:
- 2,229
- Visibility:
- Public
- Votes:
- 1
- Category:
- fas-systems
- Specialty:
- hw
- Last Updated:
- 11/29/2024, 8:03:35 AM
Applies to
- ONTAP 9
- Platforms:
- AFF A320
- AFF A300 / FAS8200
- AFF A250 / AFF C250 / FAS500f
- AFF A220 / AFF C190 / FAS27x0
- AFF A200 / FAS26x0
- AFF / FAS80x0
- FAS22x0 / FAS25x0
- FAS32x0 / FAS62x0
Issue
- Controller panics and reboots with a DIMM error:
PANIC: ECC error at DIMM-18: 2C-0F-2007-2664E6BE,ADDR 0x180a048b40,(Node(1), Memory controller(1), CH(3), DIMM(0), Rank(0), Bank Group(1), Bank(0x0), Row(0xb8b1), Col(0x2f8), Uncorrectable Machine Check Error at CPU21.
- EMS log:
cf_hwassist: cf.hwassist.takeoverTrapRecv:debug]: hw_assist: Received takeover hw_assist alert from partner(node02), system_down because dimm_uecc_error.
- Event all log:
ECC error at DIMM-2: 2C-0F-1910-20FE7F16,ADDR 0x27fce6000,(Node(0), Memory controller(0), CH(1), DIMM(0), Rank(0), Bank Group(0), Bank(0x0), Row(0x0), Col(0x0)), devtag(0x3f), correrr(0x0) Uncorrectable Machine Check Error at CPU9. BDWL_HA0 Error: STATUS<0xfe00000000010091>(Val,OverF,UnCor,Enable,MiscV,AddrV,PCC,CorrSts(0),CorrCnt(0),ExtErr(0x1),ErrCode(Channel 1, Read),ErrCode(0x91)),MISC<0x00000000406aea86>(HaDbBank(0),PE(0),ReqOpcode(0x2),RNID(0),RTID(0x35),HTID(0x75))
Requesting SP to power cycle the filer to attempt to clear DRAM UECC
[IPMI Event.critical]: DIMM UECC Fatal Error detected by Storage OS
[Trap Event.critical]: hwassist dimm_uecc_error (32)
[Trap Event.critical]: SNMP dimm_uecc_error (32)
[IPMI Event.critical]: System power cycle
[IPMI.notice]: 08e8 | 02 | EVT: 015000ad | P3V3 | Assertion Event, "Lower Non-critical going low " | Reading: 0.000 | Threshold: 3.027
[IPMI.notice]: 08e9 | 02 | EVT: 015200a9 | P3V3 | Assertion Event, "Lower Critical going low " | Reading: 0.000 | Threshold: 2.957
[IPMI.notice]: 08ea | 02 | EVT: 0300ffff | Power_Good | Assertion Event, "State Deasserted"
[IPMI.notice]: 08eb | 02 | EVT: 015006af | P12V | Assertion Event, "Lower Non-critical going low " | Reading: 0.372 | Threshold: 10.850
[IPMI.notice]: 08ec | 02 | EVT: 015206aa | P12V | Assertion Event, "Lower Critical going low " | Reading: 0.372 | Threshold: 10.540
[BMC.critical]: Filer Reboots