NetApp Element software may misreport memory errors and result in a cluster fault for memoryEccThreshold on MemCtlr0
Applies to
- NetApp Element software 12.0 and 12.2
- NetApp SolidFire SF-Series product line
- NetApp H-series storage nodes
Issue
- NetApp Element software may misreport correctable errors on DIMMs as being correctable errors on a node's memory controller
- Default settings for ECC errors on a node's memory controller are overly aggressive, resulting in a persistent, error severity cluster fault after even a single error
- The following is the cluster fault shown in NetApp SolidFire Active IQ and the cluster UI
- Error Code:
memoryEccThreshold
- Details:
Correctable ECC memory error count crossed threshold on Memory controller: MemCtlr0
- Error Code:
- Node's BMC system event log (SEL) actually reports error(s) on a DIMM at the same time as the cluster fault(s)
[Information] [Memory Error] [Memory] Correctable ECC (CPU_A0) - Asserted