Skip to main content
NetApp Knowledge Base

The failure rate for disks with odd numbers in the same shelf is high

Views:
117
Visibility:
Public
Votes:
0
Category:
fas-systems
Specialty:
hw
Last Updated:

Applies to

  • A400
  • DS224-12
  • IOM12B
  • ONTAP 9.13.1P9

Issue

  • The failure rate for disks with odd numbers in the same shelf is high
  • Shelf ID 10
  • Before this issue occurred, the customer expanded the existing shelf stack, i.e., they added shelf ID 11.
Example
Sat Oct 26 18:43:42 +0900 [node2: disk_server_0: shm.threshold.consecutiveTimeouts:error]: shm: Disk 0a.10.11 has exceeded the threshold of 11 consecutive timeouts; the system will fail the disk if possible.
Sat Oct 26 18:43:46 +0900 [node2: config_thread: raid.config.filesystem.disk.failed:error]: File system Disk 0a.10.11 Shelf 10 Bay 11 [NETAPP   X357_KPM6V3T8ATE NA51] S/N [xxxxxxxxxxxx] UID [58CE38EE:22DED66C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.
 
Sat Oct 26 18:47:30 +0900 [node2: disk_server_1: shm.threshold.consecutiveTimeouts:error]: shm: Disk 0a.10.1 has exceeded the threshold of 11 consecutive timeouts; the system will fail the disk if possible.
Sat Oct 26 18:47:31 +0900 [node2: config_thread: raid.config.filesystem.disk.failed:error]: File system Disk 0a.10.1 Shelf 10 Bay 1 [NETAPP   X357_KPM6V3T8ATE NA51] S/N [xxxxxxxxxxxx] UID [58CE38EE:22DED670:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.
 
Sat Nov 09 19:13:17 +0900 [node2: disk_server_1: shm.threshold.consecutiveTimeouts:error]: shm: Disk 0a.10.9 has exceeded the threshold of 11 consecutive timeouts; the system will fail the disk if possible.
Sat Nov 09 19:13:18 +0900 [node1: config_thread: raid.config.filesystem.disk.failed:error]: File system Disk 4d.10.9 Shelf 10 Bay 9 [NETAPP   X357_KPM6V3T8ATE NA51] S/N [xxxxxxxxxxxx] UID [58CE38EE:22DED654:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.
 
Sat Nov 09 19:25:35 +0900 [node1: disk_admin: disk.outOfService:notice]: Drive 4d.10.13 (xxxxxxxxxxxx): exceeded consecutive timeout threshold.
Sat Nov 09 19:25:35 +0900 [node1: config_thread: raid.config.filesystem.disk.failed:error]: File system Disk 4d.10.13 Shelf 10 Bay 13 [NETAPP   X357_KPM6V3T8ATE NA51] S/N [xxxxxxxxxxxx] UID [58CE38EE:22DF0000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.
 
Sun Nov 10 01:01:29 +0900 [node1: hamsg: disk.outOfService:notice]: Drive 0a.10.21 (xxxxxxxxxxxx): exceeded consecutive timeout threshold
Sun Nov 10 01:01:29 +0900 [node1: config_thread: raid.config.filesystem.disk.failed:error]: File system Disk 0a.10.21 Shelf 10 Bay 21 [NETAPP   X357_KPM6V3T8ATE NA51] S/N [xxxxxxxxxxxx] UID [58CE38EE:22DED7C4:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.
  • According to the sas-expander-map outputs, most unstable disk I/O is being processed through the same IOM12B module, which is the IOM12B B module on the shelf 10
node1
 
sas-expander-map
 
Expanders on channel 0a: 
Level    1: WWN 500a098008xxxxxx, ID 10, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12    ', Rev '0141', Slot A
Level    2: WWN 500a098008xxxxxx, ID 11, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12A   ', Rev '0311', Slot A
Expanders on channel 0b: 
Expanders on channel 0c: 
Expanders on channel 0d: 
Expanders on channel 4a: 
Expanders on channel 4b: 
Expanders on channel 4c: 
Expanders on channel 4d: 
Level    1: WWN 500a098008xxxxxx, ID 11, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12A   ', Rev '0311', Slot B
Level    2: WWN 500a098008xxxxxx, ID 10, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12    ', Rev '0141', Slot B
 
EMS
Mon Nov 11 23:49:51 +0900 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 4d.10.7: Command aborted by host adapter: HA status 0x4: cdb 0x28:974f4160:0008. 
Mon Nov 11 23:49:51 +0900 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 4d.10.7: Command aborted by host adapter: HA status 0x4: cdb 0x28:963ca878:0008. 
Mon Nov 11 23:49:51 +0900 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 4d.10.7: request successful after retry #1/#0: cdb 0x28:974f4160:0008 (8585).
Mon Nov 11 23:49:51 +0900 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 4d.10.7: request successful after retry #1/#0: cdb 0x28:963ca878:0008 (8585).
 
node2
 
sas-expander-map
Expanders on channel 0a: 
Level    1: WWN 500a098008xxxxxx, ID 10, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12    ', Rev '0141', Slot B
Level    2: WWN 500a098008xxxxxx, ID 11, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12A   ', Rev '0311', Slot B
Expanders on channel 0b: 
Expanders on channel 0c: 
Expanders on channel 0d: 
Expanders on channel 4a: 
Expanders on channel 4b: 
Expanders on channel 4c: 
Expanders on channel 4d: 
Level    1: WWN 500a098008xxxxxx, ID 11, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12A   ', Rev '0311', Slot A
Level    2: WWN 500a098008xxxxxx, ID 10, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12    ', Rev '0141', Slot A
 
EMS
Tue Nov 12 00:15:40 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 0a.10.23: Command aborted by host adapter: HA status 0x4: cdb 0x88:00000001628f7640:00000200. 
Tue Nov 12 00:15:40 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 0a.10.23: request successful after retry #1/#0: cdb 0x88:00000001628f7640:00000200 (8896).
Tue Nov 12 00:16:05 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 0a.10.3: Command aborted by host adapter: HA status 0x4: cdb 0x28:f54c71b8:0100. 
Tue Nov 12 00:16:05 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 0a.10.3: request successful after retry #1/#0: cdb 0x28:f54c71b8:0100 (8744).
Tue Nov 12 00:16:17 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 0a.10.5: Command aborted by host adapter: HA status 0x4: cdb 0x88:0000000177f2f0d0:00000110. 
Tue Nov 12 00:16:17 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 0a.10.5: request successful after retry #1/#0: cdb 0x88:0000000177f2f0d0:00000110 (8529).

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.