Skip to main content
NetApp Knowledge Base

H610S nodes offline and in a boot loop due to uncorrectable errors on NVDIMM

Views:
482
Visibility:
Public
Votes:
0
Category:
element-software
Specialty:
solidfire
Last Updated:

Applies to

  • NetApp SolidFire H610S with BIOS 3B06
  • NetApp Element software 12.3.X and below

Issue

  • Multiple nodes or single node offline and in a boot loop
    • Nodes attempt to boot but fails before loading Element
    • Reboot occurs right after the NetApp splash screen 
  • BMC system event log (SEL) will show the following:
    • [CATERR] Machine Check Exception (MCERR) 
    • [MCERR] Uncorrectable Error - Machine Check Error
    • [Memory Error] Uncorrectable ECC(CPU0_<xx>)
  • Volume offline or degraded messages are possible 

Example: Active IQ error alerts when multiple nodes are affected 

The following volumes are offline. [X, X, X, X, X, X]

The SolidFire Application cannot communicate with Storage node having node ID 11.

Cluster Block Data is in a degraded state, and the auto-heal process to restore full block data redundancy cannot proceed. Either too many nodes or block services are offline, or the cluster block services are too full.

Example: SEL from BMC web gui

  1160  Sep/8/2022  20:16:41    [Information]  [Power Unit]   [Power Unit]  Power Off / Power Down - Deasserted
  1159  Sep/8/2022  20:16:36	[Critical]     [CATERR]       [Processor]   Machine Check Exception (MCERR) - Asserted
  1158  Sep/8/2022  20:16:36	[Information]  [Power Unit]   [Power Unit]  Power Off / Power Down - Asserted
  1157  Sep/8/2022  20:16:35	[Warning]      [Additional MCE Error]	[OEM Record C2]	ManufacturerID:001C4C, Extra Information : 0 MSCOD:0010 MCACOD:0134
  1156  Sep/8/2022  20:16:35	[Critical]     [CATERR]       [Processor]   Machine Check Exception (MCERR) - Asserted
  1155  Sep/8/2022  20:16:35	[Critical]     [MCERR]        [Processor]   Uncorrectable Error - Machine Check Error: Bank 1/CPU 0/Core 2 - Asserted
  1154  Sep/8/2022  20:16:35	[Critical]     [Memory Error] [Memory]      Uncorrectable ECC(CPU0_F1) - Asserted 

Note: NVDIMMs are in specific slots on the H610S models.  H610S1/S2 - CPU0_C0 and CPU0_F0,   H610S4 - CPU0_C1 and CPU0_F1

Example: SEL from ipmitool output

SEL Record ID          : 0482
 Record Type           : 02
 Timestamp             : 09/08/2022 20:16:35
 Generator ID          : 0001
 EvM Revision          : 04
 Sensor Type           : Memory
 Sensor Number         : 87
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : a1ff29
 Description           : Uncorrectable ECC

SEL Record ID          : 0483
 Record Type           : 02
 Timestamp             : 09/08/2022 20:16:35
 Generator ID          : 0001
 EvM Revision          : 04
 Sensor Type           : Processor
 Sensor Number         : a8
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : ab0102
 Description           : Uncorrectable machine check exception

SEL Record ID          : 0484
 Record Type           : 02
 Timestamp             : 09/08/2022 20:16:35
 Generator ID          : 0020
 EvM Revision          : 04
 Sensor Type           : Processor
 Sensor Number         : 74
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : 0bffff
 Description           : Uncorrectable machine check exception

SEL Record ID          : 0485
 Record Type           : c2  (OEM timestamped)
 Timestamp             : 09/08/2022 20:16:35
 Manufactacturer ID    : 001c4c
 OEM Defined           : 000010003401 [......]

SEL Record ID          : 0486
 Record Type           : 02
 Timestamp             : 09/08/2022 20:16:36
 Generator ID          : 0020
 EvM Revision          : 04
 Sensor Type           : Power Unit
 Sensor Number         : 77
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : 00ffff
 Description           : Power off/down

SEL Record ID          : 0487
 Record Type           : 02
 Timestamp             : 09/08/2022 20:16:36
 Generator ID          : 0020
 EvM Revision          : 04
 Sensor Type           : Processor
 Sensor Number         : 74
 Event Type            : Sensor-specific Discrete
 Event Direction       : Assertion Event
 Event Data            : 0bffff
 Description           : Uncorrectable machine check exception

SEL Record ID          : 0488
 Record Type           : 02
 Timestamp             : 09/08/2022 20:16:41
 Generator ID          : 0020
 EvM Revision          : 04
 Sensor Type           : Power Unit
 Sensor Number         : 77
 Event Type            : Sensor-specific Discrete
 Event Direction       : Deassertion Event
 Event Data            : 00ffff
 Description           : Power off/down

 

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.