Poor performance and high CPU usage in a single node due to a degraded DIMM
Applies to
- ONTAP 9
- AFF A400
Issue
- High CPU causes poor performance in one node.
- High write latency in a data aggregate. Example:
Time Node Severity Event
------------------- ---------------- ------------- ---------------------------
7/24/2023 18:33:25 node_name ERROR wafl.cp.toolong: Aggregate aggr_name experienced a long CP.
7/24/2023 18:15:22 node_name ERROR wafl.cp.toolong: Aggregate aggr_name experienced a long CP.
- Node reboot after a PANIC, with a CORE DUMP file generated. Example:
"process on cpu17 hung (telnet_0) for 5001 milliseconds! in SK process telnet_0 on release 9.10.1P12 (C"
- Correctable errors in a DIMM module. Example:
Number of correctable ECC since boot 60362216: Information about Correctable ECC: ECC error at DIMM-xx: CE-03-2106-18AEE039,ADDR 0x5959b3100,(Node(1), Memory controller(0), CH(0), DIMM(0), Rank(0), Bank Group(2), Bank(0x0), Row(0x52ad), Col(0x2c0))
Correctable Machine Check Error at CPU17 McBank7. SKL_IMC0 Error: STATUS<0xcc10000001010090> (...)
Number of correctable ECC since boot 60427752: Information about Correctable ECC: ECC error at DIMM-xx: CE-03-2106-18AEE039,ADDR 0x8698e9d00,(Node(1), Memory controller(0), CH(0), DIMM(0), Rank(1), Bank Group(0), Bank(0x0), Row(0x7d3f), Col(0x70))
Correctable Machine Check Error at CPU13 McBank7. SKL_IMC0 Error: STATUS<0xcc10000001010090> (...)
- Memory Error Alert triggered for that DIMM. Example:
[node_name: mgwd: callhome.hm.alert.critical:debug]: Call home for Health Monitor process nphm: CriticalCECCCountMemErrAlert[DIMM-xx].