AFF A700s CECC: Correctable Machine Check Errors being reported against wrong DIMM
Applies to
- AFF A7 00s
- ONTAP 9
- ONTAP 9.1P17 and earlier
- ONTAP 9.3P11 and earlier
- ONTAP 9.4P6 and earlier
Issue
The CECC error is reported in the same DIMM even after a replacement:
The system health alert show
command reports errors similar to the following on the cluster:
Node xxxxxx
Monitor controller
Alert ID CriticalCECCCountMemErrAlert
Alerting Resource DIMM-x
Subsystem Memory
Indication Time Tue Oct 09 12:24:36 2018
Perceived Severity Critical
Probable Cause DIMM_Degraded
Description The DIMM has degraded, leading to memory errors.
The following are corrective actions:
1. Contact technical support to obtain a new DIMM of the same specification
2. If possible, perform a takeover of this node and bring the node down for maintenance
3. Refer to the DIMM replacement guide for your given hardware platform to replace the DIMM
4. Bring the storage system online
Possible Effect:
Memory issues can lead to a catastrophic system panic, which can lead to data downtime on the node.
The EMS log displays a message similar to the following, reporting CECC error on the specific DIMM:
[?] Tue Oct 09 12:24:36 IST [xxxx: mgwd: callhome.hm.alert.critical:alert]: Call home for Health Monitor process nphm: CriticalCECCCountMemErrAlert[DIMM-x].
Normally, a replacement of this DIMM is suggested.
However, even after the replacement, the errors in the same DIMM might be reported by the cluster.