Skip to main content

NetApp wins prestigious Coveo Relevance Pinnacle Award. Learn more!

INSIGHT Japan :2023年 1月25日(水)ANAインターコンチネンタルホテル開催 へ参加・申込を行う

NetApp Knowledge Base

How to troubleshoot correctable memory errors on FAS and AFF systems

Last Updated:

Applies to

  • ONTAP versions
    • 9.5 and later releases
    • 9.4P6 and later 9.4 releases
    • 9.3P11 and later 9.3 releases
    • 9.1P18 and later 9.1 releases
  • Any FAS and AFF platform other than the FAS25XX, FAS22XX, FAS/V32XX, and FAS/V62XX.


For all ONTAP FAS and AFF platforms other than the FAS25XX, FAS22XX, FAS/V32XX, and FAS/V62XX, this article supersedes Knowledgebase article How to troubleshoot correctable memory errors.

  1. Why is NetApp changing correctable memory error monitoring in ONTAP?
  • NetApp storage systems utilize error-correcting code (ECC) memory modules (DIMMs) for both main system memory and NVRAM/NVMEM subsystems. When possible, memory errors are corrected in-flight by the memory subsystem hardware with little to no impact on system performance.
    • Until recently, ONTAP running on AFF/FAS storage systems employed a longstanding policy to alert the system administrator about “excessive” CECC memory errors based on a threshold of 500 errors since the last reboot of the system.
    • After recent extensive analysis of correctable ECC (CECC) memory errors by NetApp and its hardware component vendors, it was determined that CECC memory errors are typically not a good predictor of a system disruption due to uncorrectable ECC (UECC) memory errors – especially with the latest generations of memory controllers and dynamic random-access memory (DRAM).
    • Additionally, the CPU cycles used to monitor, log and correct large numbers of memory errors have negligible impact to system performance.
  • As a result, NetApp has changed the monitoring algorithm for CECC memory errors used by ONTAP on many currently-supported AFF/FAS systems to a dynamic monitoring algorithm, with much higher thresholds configured to trigger the “CriticalCECCCountMemErrAlert” controller Health Monitor alert and corresponding "Health Monitor" AutoSupport message.
    • Alerts triggered using the older policy can be considered false positives and should not be taken as an indication for memory replacement as it will result in unnecessary hardware maintenance with no tangible benefits.
  • NVRAM DIMMs do have correctable ECC replacement guidelines that are not ONTAP version-specific.  For more information, refer to the replacement guidelines matrix. 
  1. With the dynamic monitoring algorithm in place, how do I determine when a DIMM or NVDIMM needs to be replaced due to excessive correctable or uncorrectable memory errors?

Refer to the table below for memory replacement guidelines:

ECC Type Category Replacement Criteria

Correctable (CECC)

(Dynamic algorithm)


ONTAP versions:

  • 9.1P18 and later 9.1 releases
  • 9.3P11 and later 9.3 releases
  • 9.4P6 and later 9.4 releases
  • 9.5 and later major releases
  • Do not replace a memory DIMM or NVDIMM based on high CECC memory error counts.
  • DIMM or NVDIMM replacement is only appropriate if ONTAP explicitly triggers:
    • “CriticalCECCCountMemErrAlert” alert in EMS
    • an AutoSupport “Health Monitor” message 


HA Group Notification from cluster-01 (Health Monitor process nphm: CriticalCECCCountMemErrAlert[DIMM-14]) ALERT.

  • Memory should only be replaced if this alert is seen.
Correctable (CECC)


ONTAP versions prior to:

  • 9.1P18
  • 9.3P11
  • 9.4P6
  • 9.5
  • DIMMs reporting correctable ECC errors should NOT be replaced only because correctable ECC errors are seen in EMS logs or if the “CriticalCECCCountMemErrAlert” system event and AutoSupport messages are seen.
    • DIMM is not in a failed state
    • Prior ONTAP versions using the older algorithm policy can generate false positives
  • To  proactively monitor DIMMs, it is advised to upgrade to a Recommended Release of ONTAP.
Correctable (CECC)

NVRAM DIMM, all ONTAP versions

  • NVRAM11 (AFF A900)
  • NVRAM10 (FAS9000, AFF A700)
  • NVRAM10P (AFF A700s)
  • NVRAM9 (AFF/FAS80x0)
  • NVRAM DIMMs are a FRU (except for NVRAM10P). Replace the NVRAM DIMM (or NVRAM10P card) when the CECC count is greater than 2 per week or greater than 5 per month.
  • Use the system node environment sensors show command to view the NV CECC Error counter.
    • The NV CECC error counter name varies by NVRAM type:
      • NVRAM11: NV DIMM1 CECC Count, NV DIMM2 CECC Count
      • NVRAM10: NV DIMM0 CECC Count, NV DIMM1 CECC Count (more info)
      • NVRAM10P: NVRAM CECC Count
      • NVRAM9: NV Correctable ECC count
  • You can also read the NV CECC error counter (last-sensor-value) in the PLATFORM-SENSORS.XML AutoSupport  file.


<asup:ROW col_time_us="3423606694499">    
   <name>NV Correctable ECC count</name>  
   <discrete-sensor-state />  
   <discrete-sensor-value />  
   <critical-low-threshold /> 
   <warning-low-threshold />  
   <warning-high-threshold /> 
  <critical-high-threshold />
Uncorrectable (UECC) Panic
Uncorrectable Machine Check Error (UMCE) Panic
All platforms, ONTAP versions
PPR results must be checked before considering part replacement, if available.

Check the system console log from the SP or BMC. Examine the console log for panic message details and Post Package Repair (PPR) operation results.

If PPR information is not available, replace the DIMM associated with the panic.

If PPR information results are available:

  • Replacement not required
    If PPR can detect the problematic memory segment, it will repair it.
    • If the system can recover, it will provide messaging around the event. PPR:Sequence PASS.
    • No further action is needed
  • Replacement required
    If the memory fails or cannot be repaired, the system will not boot ONTAP and a DIMM replacement will be required.
    • If the same DIMM experiences a 2nd UECC error and panic, contact NetApp to order a DIMM replacement

See: BIOS updates for memory reliability and the PPR feature

 Check Active IQ to see if CECC memory impact your systems.


  • On versions of ONTAP that use the dynamic algorithm, CECC memory errors continue to be periodically logged in ONTAP event logs. However, they are no longer relevant in determining the need for DIMM replacement.
  • Correctable ECC errors are not an indicator that an uncorrectable ECC error will occur.   Should an uncorrectable memory error occur, it will cause a system disruption (panic). If a system disruption occurs, the panic message will call out the DIMM or DIMMs where the uncorrectable error occurred. Those DIMMs might need to be replaced (see table, above).
  • Recent BIOS/LOADER releases for current shipping ONTAP platforms contain memory management enhancements. These updates improve resiliency to uncorrectable ECC errors as well as reduce scenarios where DIMMs can be mapped out during boot such as Bugs 1195242, 1195243, or 1195423. If your BIOS version is not the latest available for your AFF or FAS system, NetApp recommends updating the BIOS to the latest version.  Find the latest BIOS/LOADER version for your systems on the System Firmware & Diagnostics Download page.
  • JEDEC-standard NVDIMM modules are used in the following platforms:
    • AFF A800, AFF A400, AFF A320
    • FAS8700, FAS8300
Scan to view the article on your device