Element OS upgrade causes drive ejection and ghost drives due to CRC errors
Applies to
- NetApp HCI - H410S
- Element OS 12.9.x
Issue
- During an upgrade to 12.9.0.151 on an H410S node, one or more drives intermittently eject or transition between states (for example Available and Failed).
-
Observed behaviors include:
- Drives in specific slots repeatedly drop offline, reappear, or fail during/after reboot and re add attempts
- Ghost drive objects appear after link resets, confusing slot to drive mapping
smartctlreports the drives as healthy and SMART self tests pass, yet the platform still ejects them- CRC Failures with below codes increases rapidly on only the affected devices which is seen on the kern.logs, while other drives remain at zero:
-
Error Code Count Meaning 0x31080000 186,046 PL code 0x08: SATA NCQ protocol error (link CRC failures) 0x31110610 7,367 PL code 0x11, sub 0x0610: Open Failure – Rate Not Supported (PHY speed negotiation failure) 0x31110d01 1,071 PL code 0x11, sub 0x0d01: Open Failure – Zone Violation (SAS topology/routing error) 0x31120434 181 PHY reset/retry 0x31120b10 68 PHY link negotiation 0x31120d02 51 Zone/routing error 0x31111000 35 PHY reset/link change
-
