High failure rate on ST16000NM002G drives under very specific workload conditions
Applies to
- E5760
- GPFS
- SANtricity OS 11.60.2R1 - 11.70.2
- Seagate ST16000NM002G drive firmware NE00 and/or NE01
Issue
The issue so far has only been observed on IBM GPFS filesystem spanning multiple E5760 E-Series storage arrays under specific worload conditions.
In this particular instance, from drive analysis by drive vendor, 99.99% of Writes are in .01 % of the drive, and within 1.6GB of range.
With up to 106MB/s writes to some hot spots at lower LBA range.
Symptoms may include:
- Degraded drive channels as a result of drive side timeouts
- Write timeouts on multiple drives timing out (
IOP_FAST_TIMEOUT_ERROR) - PI errors
- Unreadable sectors reported (URS/data loss)
Regular troubleshooting steps as detailed in E-Series degraded drive channel and multiple individual drive degraded path KB does not bring a resolution.
Issue is occurring in different shelves/drawers/drive bays, and there is no identifiable common component in the chain that is failing.
Reseat of all drives and snake cables (or other troubleshooting steps from above KB) do not bring any improvement.
Dives are less than a year old (well below the 5 year age limit), and replaced drive in same slot show the same symptoms/fail as well.
Major event logs will show events similar to the following:
A:11/30/21, 3:31:03 AM (03:31:03) 2206 1209 Drive channel set to Degraded - Drive-side: channel 3 <--CRITICALA:11/30/21, 3:31:03 AM (03:31:03) 2205 1513 Individual drive - Degraded path - Drive-side: channel 3 <--CRITICALA:11/30/21, 3:30:55 AM (03:30:55) 2204 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5A:11/30/21, 3:30:46 AM (03:30:46) 2203 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x189fae400, Blocks: 0x400 - Recovered----> Flags: 0x40202001 = READ: Read Operation, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328A:11/30/21, 3:30:43 AM (03:30:43) 2202 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0xb49c7358, Blocks: 0x8 - Recovered----> Flags: 0x40202001 = READ: Read Operation, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328A:11/30/21, 3:30:43 AM (03:30:43) 2201 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5A:11/30/21, 3:30:06 AM (03:30:06) 2200 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5A:11/30/21, 3:29:49 AM (03:29:49) 2199 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5A:11/30/21, 3:29:41 AM (03:29:41) 2198 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x21639800, Blocks: 0x400 - Recovered----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328A:11/30/21, 3:29:39 AM (03:29:39) 2197 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5A:11/30/21, 3:29:38 AM (03:29:38) 2196 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x1538dcec0, Blocks: 0x10 - Recovered----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328A:11/30/21, 3:29:35 AM (03:29:35) 2195 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x1266587a0, Blocks: 0x8 - Recovered----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328A:12/31/21, 9:31:45 AM (09:31:45) 52721 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239814b <--CRITICAL----> Physical Drive in Tray 0 Slot 0, LBA: 0x84047314bA:12/31/21, 9:31:44 AM (09:31:44) 52720 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239814a <--CRITICAL----> Physical Drive in Tray 0 Slot 0, LBA: 0x84047314aA:12/31/21, 9:31:42 AM (09:31:42) 52719 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c2398149 <--CRITICAL----> Physical Drive in Tray 0 Slot 0, LBA: 0x840473149A:12/31/21, 9:31:41 AM (09:31:41) 52718 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c2398148 <--CRITICAL----> Physical Drive in Tray 0 Slot 0, LBA: 0x840473148A:12/31/21, 9:31:41 AM (09:31:41) 52717 201e VDD repair started - Shelf 30, Bay A - SSID: 33, Devnum: 0xffffffA:12/31/21, 9:31:41 AM (09:31:41) 52716 201f VDD repair completed - Shelf 30, Bay A - SSID: 33, Devnum: 0x010217 LBA: 0x12c2399800----> Flags: 0x202005 = READ: Read Operation, ERROR: IO Compl. w. Err, NOLOCK: Prevent lock during read err., PI: Error coding in effect - Error: 0x844 = UA_MISCORRECTED_DATA_ERRORA:12/31/21, 9:31:40 AM (09:31:40) 52715 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239994f <--CRITICAL----> Physical Drive in Tray 32 Slot 24, LBA: 0x3eed7314fA:12/31/21, 9:31:40 AM (09:31:40) 52714 1012 Destination driver error - Shelf 32, Drawer 2, Bay 11A:12/31/21, 9:31:40 AM (09:31:40) 52713 1016 Drive returned unrecoverable media error - Shelf 32, Drawer 2, Bay 11----> Sense 3/11/0 = Medium Error - Unrecovered read error - CDB: 0x7f(0x9) = Read(32) - LBA: ~0x3eed7314fA:12/31/21, 9:31:37 AM (09:31:37) 52712 1016 Drive returned unrecoverable media error - Shelf 32, Drawer 2, Bay 11----> Sense 3/11/0 = Medium Error - Unrecovered read error - CDB: 0x7f(0x9) = Read(32) - LBA: ~0x3eed7314fA:12/25/21, 7:58:16 AM (07:58:16) 47154 100d Timeout on drive side of controller - Shelf 33, Drawer 4, Bay 5B:12/25/21, 7:58:40 AM (07:58:40) 47153 2215 Drive marked failed - Shelf 33, Drawer 4, Bay 5B:12/25/21, 7:58:40 AM (07:58:40) 47152 226c Drive failure - Shelf 33, Drawer 4, Bay 5 - Cause: 3 = Write failure; Drive WWN: 5000c500cadc69b7; SN: ZL29F9KB0000C107BKS5 <--CRITICALB:12/25/21, 7:58:40 AM (07:58:40) 47151 2226 Drive spun down - Shelf 33, Drawer 4, Bay 5B:12/25/21, 7:58:40 AM (07:58:40) 47150 7e05 Drive recovery criteria not met - Shelf 33, Drawer 4, Bay 5B:12/25/21, 7:58:39 AM (07:58:39) 47149 100d Timeout on drive side of controller - Shelf 33, Drawer 4, Bay 5