AFF A700s CECC: Correctable Machine Check Errors being reported against wrong DIMM
Applies to
- AFF A7 00s
- ONTAP 9
- ONTAP 9.1P17 and earlier
- ONTAP 9.3P11 and earlier
- ONTAP 9.4P6 and earlier
Issue
The CECC error is reported in the same DIMM even after a replacement:
- The
system health alert show
command reports errors similar to the following on the cluster:
Node xxxxxx
Monitor controller
Alert ID CriticalCECCCountMemErrAlert
Alerting Resource DIMM-x
Subsystem Memory
Indication Time Tue Oct 09 12:24:36 2018
Perceived Severity Critical
Probable Cause DIMM_Degraded
Description The DIMM has degraded, leading to memory errors.
- The EMS log displays a message similar to the following, reporting CECC error on the specific DIMM:
[?] Tue Oct 09 12:24:36 IST [xxxx: mgwd: callhome.hm.alert.critical:alert]: Call home for Health Monitor process nphm: CriticalCECCCountMemErrAlert[DIMM-x].
- Normally, a replacement of this DIMM is suggested.
- However, even after the replacement, the errors in the same DIMM might be reported by the cluster.