CriticalCECCCountMemErrAlert and BootDimmDisableAlert in observed in AFF A1K
Applies to
- AFF A1K
- System DIMM modules
Issue
- ONTAP triggers alert against one DIMM module as follows for CriticalCECCCountMemErrAlertMessage in EMS
[CLUSTER-01: mgwd: callhome.hm.alert.critical:alert]: Call home for Health Monitor process nphm: CriticalCECCCountMemErrAlert[DIMM-32].
- Output for command
::*> memory dimm show -node <node_name>shows a single DIMM as "degraded"
::*> memory dimm show -node CLUSTER-01 (system controller memory dimm show) DIMM UECC CECC Alert CPU Slot FailureNode Name Count Count Method Socket Channel Number Status Reason------------- ------- ----- ----- ------ ------ ------- ------ ------- --------NAS3_APP_A DIMM-1 0 0 bucket 1 7 0 ok none ... ... DIMM-32 0 151597 bucket 0 3 0 degraded none <<<<<<<16 entries were displayed.
- Replacing the affected DIMM does not fix the issue:
- The DIMM shows failed during boot up sequence
- Additional DIMM is failed
- Multiple DIMM modules are disabled
DIMM in slot 1 is disabledDIMM in slot 5 is disabledDIMM in slot 7 is disabledDIMM in slot 12 is disabledDIMM in slot 14 is disabledDIMM in slot 16 is disabledDIMM in slot 17 is disabledDIMM in slot 21 is disabledDIMM in slot 23 is disabledDIMM in slot 28 is disabledDIMM in slot 30 failed <<<<<< New failedDIMM in slot 32 failed
- During boot sequence the following error is observed:
Apr 13 21:59:46 [CLUSTER-01:platform.reducedMemory:ALERT]: System memory (255 GB) is less than expected (1024 GB). Check DIMMs slots 1, 5, 7, 12, 14, 16, 17, 21, 23, 28, 30, 32.
- Swaping the DIMM modules to different slots do not solve the issue:
Initializing System Memory ...DIMM:32 mapped out. BIOS MRC mapped out DIMM. Major / Minor Error Code: 0x46 / 0x03Complete channel mapped out.
- The system is able to boot up but new alert "
BootDimmDisableAlert" is triggered for each one of the disabled DIMMs
