Skip to main content
NetApp Knowledge Base

High failure rate on ST16000NM002G drives under very specific workload conditions

Views:
366
Visibility:
Public
Votes:
1
Category:
e-series-santricity-os-controller-software
Specialty:
esg
Last Updated:

Applies to

  • E5760
  • GPFS
  • SANtricity OS 11.60.2R1 - 11.70.2
  • Seagate ST16000NM002G drive firmware NE00 and/or NE01

Issue

The issue so far has only been observed on IBM GPFS filesystem spanning multiple E5760 E-Series storage arrays under specific worload conditions.
In this particular instance, from drive analysis by drive vendor, 99.99% of Writes are in .01 % of the drive, and within 1.6GB of range.
With up to 106MB/s writes to some hot spots at lower LBA range.
 
Symptoms may include:
  • Degraded drive channels as a result of drive side timeouts
  • Write timeouts on multiple drives timing out (IOP_FAST_TIMEOUT_ERROR)
  • PI errors
  • Unreadable sectors reported (URS/data loss)

Regular troubleshooting steps as detailed in E-Series degraded drive channel and multiple individual drive degraded path KB does not bring a resolution.

 
Issue is occurring in different shelves/drawers/drive bays, and there is no identifiable common component in the chain that is failing.
Reseat of all drives and snake cables (or other troubleshooting steps from above KB) do not bring any improvement.
Dives are less than a year old (well below the 5 year age limit), and replaced drive in same slot show the same symptoms/fail as well.
 
Major event logs will show events similar to the following:
 
A:11/30/21, 3:31:03 AM (03:31:03) 2206 1209 Drive channel set to Degraded - Drive-side: channel 3 <--CRITICAL
A:11/30/21, 3:31:03 AM (03:31:03) 2205 1513 Individual drive - Degraded path - Drive-side: channel 3 <--CRITICAL
A:11/30/21, 3:30:55 AM (03:30:55) 2204 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:30:46 AM (03:30:46) 2203 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x189fae400, Blocks: 0x400 - Recovered
----> Flags: 0x40202001 = READ: Read Operation, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:30:43 AM (03:30:43) 2202 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0xb49c7358, Blocks: 0x8 - Recovered
----> Flags: 0x40202001 = READ: Read Operation, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:30:43 AM (03:30:43) 2201 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:30:06 AM (03:30:06) 2200 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:29:49 AM (03:29:49) 2199 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:29:41 AM (03:29:41) 2198 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x21639800, Blocks: 0x400 - Recovered
----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:29:39 AM (03:29:39) 2197 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:29:38 AM (03:29:38) 2196 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x1538dcec0, Blocks: 0x10 - Recovered
----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:29:35 AM (03:29:35) 2195 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x1266587a0, Blocks: 0x8 - Recovered
----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
 
A:12/31/21, 9:31:45 AM (09:31:45) 52721 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239814b <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x84047314b
A:12/31/21, 9:31:44 AM (09:31:44) 52720 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239814a <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x84047314a
A:12/31/21, 9:31:42 AM (09:31:42) 52719 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c2398149 <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x840473149
A:12/31/21, 9:31:41 AM (09:31:41) 52718 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c2398148 <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x840473148
A:12/31/21, 9:31:41 AM (09:31:41) 52717 201e VDD repair started - Shelf 30, Bay A - SSID: 33, Devnum: 0xffffff
A:12/31/21, 9:31:41 AM (09:31:41) 52716 201f VDD repair completed - Shelf 30, Bay A - SSID: 33, Devnum: 0x010217 LBA: 0x12c2399800
----> Flags: 0x202005 = READ: Read Operation, ERROR: IO Compl. w. Err, NOLOCK: Prevent lock during read err., PI: Error coding in effect - Error: 0x844 = UA_MISCORRECTED_DATA_ERROR
A:12/31/21, 9:31:40 AM (09:31:40) 52715 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239994f <--CRITICAL
----> Physical Drive in Tray 32 Slot 24, LBA: 0x3eed7314f
A:12/31/21, 9:31:40 AM (09:31:40) 52714 1012 Destination driver error - Shelf 32, Drawer 2, Bay 11
A:12/31/21, 9:31:40 AM (09:31:40) 52713 1016 Drive returned unrecoverable media error - Shelf 32, Drawer 2, Bay 11
----> Sense 3/11/0 = Medium Error - Unrecovered read error - CDB: 0x7f(0x9) = Read(32) - LBA: ~0x3eed7314f
A:12/31/21, 9:31:37 AM (09:31:37) 52712 1016 Drive returned unrecoverable media error - Shelf 32, Drawer 2, Bay 11
----> Sense 3/11/0 = Medium Error - Unrecovered read error - CDB: 0x7f(0x9) = Read(32) - LBA: ~0x3eed7314f
 
A:12/25/21, 7:58:16 AM (07:58:16) 47154 100d Timeout on drive side of controller - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:40 AM (07:58:40) 47153 2215 Drive marked failed - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:40 AM (07:58:40) 47152 226c Drive failure - Shelf 33, Drawer 4, Bay 5 - Cause: 3 = Write failure; Drive WWN: 5000c500cadc69b7; SN: ZL29F9KB0000C107BKS5 <--CRITICAL
B:12/25/21, 7:58:40 AM (07:58:40) 47151 2226 Drive spun down - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:40 AM (07:58:40) 47150 7e05 Drive recovery criteria not met - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:39 AM (07:58:39) 47149 100d Timeout on drive side of controller - Shelf 33, Drawer 4, Bay 5

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.