Node services are hung and all drives become available after a Correctable ECC error
Applies to
NetApp H700S
Issue
- System Event Log (SEL) at BMC/IPMI shows
Correctable ECC
alert at the time of event<Event ID> | <DATE> | <TIME> | Memory | Correctable ECC (@DIMM<ID>(CPU<#>)) | Asserted
- NetApp SolidFire ActiveIQ and cluster UI shows that all block and metadata services of the node are unresponsive shortly after the SEL log entry:
unresponsiveService - A block service is not responding.
(This alert will be shown for each block drive)unresponsiveService - A metadata service is not responding.
- Unhealthy service alerts are shown shortly after the unresponsive services alerts:
blockServiceUnhealthy - A block service is unhealthy and SolidFire is attempting to migrate data away from it.
sliceServiceUnhealthy - A metadata service is unhealthy and SolidFire is attempting to migrate data away from it.
- All drives of the affected node becomes available after a while (NetApp SolidFire ActiveIQ and cluster UI):
driveAvailable - Node ID X has Y available drive(s).