SolidFire block drive being ejected from the cluster repetitively (including replacement drives)
Applies to
All Element Software storage nodes
Issue
- A drive is failed and replacement drive keeps being ejected after adding it back to the cluster and becomes available again.
- Cluster shows an extended time (100s of hours or more) for block sync but it ends in a reasonable time. (Example screenshot can be seen in Additional Information)
blockServiceUnhealthy
alert is being generated at Alerts section.Unhealthy block service added
event shown in the Events section as soon when the drive is being added to the cluster.- In some cases you also get
lowDriveLife
alerts - Following errors seen in kern.log
-
2024-11-17T23:04:28.102407Z hci-stg-03 kernel: [1458248.977688] print_req_error: I/O error, dev sde, sector 480 2024-11-17T23:04:28.102409Z hci-stg-03 kernel: [1458248.977690] Buffer I/O error on dev sde, logical block 60, async page read 2024-11-17T23:05:11.847278Z hci-stg-03 kernel: [1458292.722559] sd 10:0:6:0: [sde] Unaligned partial completion (resid=1020, sector_sz=512) 2024-11-17T23:05:11.847286Z hci-stg-03 kernel: [1458292.722567] sd 10:0:6:0: [sde] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE 2024-11-17T23:05:11.847289Z hci-stg-03 kernel: [1458292.722570] sd 10:0:6:0: [sde] tag#0 Sense Key : Aborted Command [current] 2024-11-17T23:05:11.847292Z hci-stg-03 kernel: [1458292.722573] sd 10:0:6:0: [sde] tag#0 Add. Sense: Information unit iuCRC error detected 2024-11-17T23:05:11.847295Z hci-stg-03 kernel: [1458292.722576] sd 10:0:6:0: [sde] tag#0 CDB: Read(10) 28 00 00 00 00 08 00 00 08 00 2024-11-17T23:05:11.847297Z hci-stg-03 kernel: [1458292.722578] print_req_error: I/O error, dev sde, sector 8
-