Elevated Write Latency in MetroCluster Triggered by NVDIMM Failure
Applies to
- ONTAP 9
- MetroCluster
Issue
- A sudden spike in write latency was observed on the cluster during a period when a NVDIMM (Non-Volatile DIMM) was failing. The issue coincided with the following sequence of events:
[node-01:cf_main:cf.fsm.takeover.panic:alert]: Failover monitor: takeover attempted after partner panic.[node-01:cf_takeover:cf.fm.takeoverComplete:notice]: Failover monitor: takeover completed[node-01:cf_main:cf.fsm.autoGivebackStarted:info]: Failover monitor: Automatic giveback started[node-01:cf_giveback:cf.fm.givebackComplete:notice]: Failover monitor: giveback completed[node-02:nphmd:hm.alert.cleared:notice]: AlertId=CriticalCECCCountMemErrAlert, AlertingResource=NVDIMM-11 cleared by monitor controller -
Node-02 experienced a system panic due to degraded NVRAM, triggering automatic takeover by partner node (Node-01). -
After the takeover, ONTAP performed an automatic giveback, returning aggregates to the affected node.
-
Post-giveback, Node-02 continued to operate with degraded NVRAM, resulting in elevated write latency across the MetroCluster.
