H610S nodes offline and in a boot loop due to uncorrectable errors on NVDIMM
Applies to
- NetApp SolidFire H610S with BIOS 3B06
- NetApp Element software 12.3.X and below
Issue
- Multiple nodes or single node offline and in a boot loop
- Nodes attempt to boot but fails before loading Element
- Reboot occurs right after the NetApp splash screen
- BMC system event log (SEL) will show the following:
[CATERR] Machine Check Exception (MCERR)
[MCERR] Uncorrectable Error - Machine Check Error
[Memory Error] Uncorrectable ECC(CPU0_<xx>)
- Volume offline or degraded messages are possible
Example: Active IQ error alerts when multiple nodes are affected
The following volumes are offline. [X, X, X, X, X, X]
The SolidFire Application cannot communicate with Storage node having node ID 11.
Cluster Block Data is in a degraded state, and the auto-heal process to restore full block data redundancy cannot proceed. Either too many nodes or block services are offline, or the cluster block services are too full.
Example: SEL from BMC web gui
1160 Sep/8/2022 20:16:41 [Information] [Power Unit] [Power Unit] Power Off / Power Down - Deasserted
1159 Sep/8/2022 20:16:36 [Critical] [CATERR] [Processor] Machine Check Exception (MCERR) - Asserted
1158 Sep/8/2022 20:16:36 [Information] [Power Unit] [Power Unit] Power Off / Power Down - Asserted
1157 Sep/8/2022 20:16:35 [Warning] [Additional MCE Error] [OEM Record C2] ManufacturerID:001C4C, Extra Information : 0 MSCOD:0010 MCACOD:0134
1156 Sep/8/2022 20:16:35 [Critical] [CATERR] [Processor] Machine Check Exception (MCERR) - Asserted
1155 Sep/8/2022 20:16:35 [Critical] [MCERR] [Processor] Uncorrectable Error - Machine Check Error: Bank 1/CPU 0/Core 2 - Asserted
1154 Sep/8/2022 20:16:35 [Critical] [Memory Error] [Memory] Uncorrectable ECC(CPU0_F1) - Asserted
Note: NVDIMMs are in specific slots on the H610S models. H610S1/S2 - CPU0_C0 and CPU0_F0, H610S4 - CPU0_C1 and CPU0_F1
Example: SEL from ipmitool output
SEL Record ID : 0482 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0001 EvM Revision : 04 Sensor Type : Memory Sensor Number : 87 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : a1ff29 Description : Uncorrectable ECC SEL Record ID : 0483 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0001 EvM Revision : 04 Sensor Type : Processor Sensor Number : a8 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : ab0102 Description : Uncorrectable machine check exception SEL Record ID : 0484 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0020 EvM Revision : 04 Sensor Type : Processor Sensor Number : 74 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 0bffff Description : Uncorrectable machine check exception SEL Record ID : 0485 Record Type : c2 (OEM timestamped) Timestamp : 09/08/2022 20:16:35 Manufactacturer ID : 001c4c OEM Defined : 000010003401 [......] SEL Record ID : 0486 Record Type : 02 Timestamp : 09/08/2022 20:16:36 Generator ID : 0020 EvM Revision : 04 Sensor Type : Power Unit Sensor Number : 77 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 00ffff Description : Power off/down SEL Record ID : 0487 Record Type : 02 Timestamp : 09/08/2022 20:16:36 Generator ID : 0020 EvM Revision : 04 Sensor Type : Processor Sensor Number : 74 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 0bffff Description : Uncorrectable machine check exception SEL Record ID : 0488 Record Type : 02 Timestamp : 09/08/2022 20:16:41 Generator ID : 0020 EvM Revision : 04 Sensor Type : Power Unit Sensor Number : 77 Event Type : Sensor-specific Discrete Event Direction : Deassertion Event Event Data : 00ffff Description : Power off/down