The failure rate for disks with odd numbers in the same shelf is high
Applies to
- A400
- DS224-12
- IOM12B
- ONTAP 9.13.1P9
Issue
- The failure rate for disks with odd numbers in the same shelf is high
- Shelf ID 10
- Before this issue occurred, the customer expanded the existing shelf stack, i.e., they added shelf ID 11.
Example
Sat Oct 26 18:43:42 +0900 [node2: disk_server_0: shm.threshold.consecutiveTimeouts:error]: shm: Disk 0a.10.11 has exceeded the threshold of 11 consecutive timeouts; the system will fail the disk if possible.Sat Oct 26 18:43:46 +0900 [node2: config_thread: raid.config.filesystem.disk.failed:error]: File system Disk 0a.10.11 Shelf 10 Bay 11 [NETAPP X357_KPM6V3T8ATE NA51] S/N [xxxxxxxxxxxx] UID [58CE38EE:22DED66C:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.Sat Oct 26 18:47:30 +0900 [node2: disk_server_1: shm.threshold.consecutiveTimeouts:error]: shm: Disk 0a.10.1 has exceeded the threshold of 11 consecutive timeouts; the system will fail the disk if possible.Sat Oct 26 18:47:31 +0900 [node2: config_thread: raid.config.filesystem.disk.failed:error]: File system Disk 0a.10.1 Shelf 10 Bay 1 [NETAPP X357_KPM6V3T8ATE NA51] S/N [xxxxxxxxxxxx] UID [58CE38EE:22DED670:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.Sat Nov 09 19:13:17 +0900 [node2: disk_server_1: shm.threshold.consecutiveTimeouts:error]: shm: Disk 0a.10.9 has exceeded the threshold of 11 consecutive timeouts; the system will fail the disk if possible.Sat Nov 09 19:13:18 +0900 [node1: config_thread: raid.config.filesystem.disk.failed:error]: File system Disk 4d.10.9 Shelf 10 Bay 9 [NETAPP X357_KPM6V3T8ATE NA51] S/N [xxxxxxxxxxxx] UID [58CE38EE:22DED654:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.Sat Nov 09 19:25:35 +0900 [node1: disk_admin: disk.outOfService:notice]: Drive 4d.10.13 (xxxxxxxxxxxx): exceeded consecutive timeout threshold.Sat Nov 09 19:25:35 +0900 [node1: config_thread: raid.config.filesystem.disk.failed:error]: File system Disk 4d.10.13 Shelf 10 Bay 13 [NETAPP X357_KPM6V3T8ATE NA51] S/N [xxxxxxxxxxxx] UID [58CE38EE:22DF0000:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.Sun Nov 10 01:01:29 +0900 [node1: hamsg: disk.outOfService:notice]: Drive 0a.10.21 (xxxxxxxxxxxx): exceeded consecutive timeout thresholdSun Nov 10 01:01:29 +0900 [node1: config_thread: raid.config.filesystem.disk.failed:error]: File system Disk 0a.10.21 Shelf 10 Bay 21 [NETAPP X357_KPM6V3T8ATE NA51] S/N [xxxxxxxxxxxx] UID [58CE38EE:22DED7C4:00000000:00000000:00000000:00000000:00000000:00000000:00000000:00000000] failed.- According to the
sas-expander-mapoutputs, most unstable disk I/O is being processed through the same IOM12B module, which is the IOM12B B module on the shelf 10
node1
sas-expander-mapExpanders on channel 0a: Level 1: WWN 500a098008xxxxxx, ID 10, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12 ', Rev '0141', Slot ALevel 2: WWN 500a098008xxxxxx, ID 11, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12A ', Rev '0311', Slot AExpanders on channel 0b: Expanders on channel 0c: Expanders on channel 0d: Expanders on channel 4a: Expanders on channel 4b: Expanders on channel 4c: Expanders on channel 4d: Level 1: WWN 500a098008xxxxxx, ID 11, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12A ', Rev '0311', Slot BLevel 2: WWN 500a098008xxxxxx, ID 10, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12 ', Rev '0141', Slot BEMSMon Nov 11 23:49:51 +0900 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 4d.10.7: Command aborted by host adapter: HA status 0x4: cdb 0x28:974f4160:0008. Mon Nov 11 23:49:51 +0900 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 4d.10.7: Command aborted by host adapter: HA status 0x4: cdb 0x28:963ca878:0008. Mon Nov 11 23:49:51 +0900 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 4d.10.7: request successful after retry #1/#0: cdb 0x28:974f4160:0008 (8585).Mon Nov 11 23:49:51 +0900 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 4d.10.7: request successful after retry #1/#0: cdb 0x28:963ca878:0008 (8585).node2
sas-expander-mapExpanders on channel 0a: Level 1: WWN 500a098008xxxxxx, ID 10, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12 ', Rev '0141', Slot BLevel 2: WWN 500a098008xxxxxx, ID 11, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12A ', Rev '0311', Slot BExpanders on channel 0b: Expanders on channel 0c: Expanders on channel 0d: Expanders on channel 4a: Expanders on channel 4b: Expanders on channel 4c: Expanders on channel 4d: Level 1: WWN 500a098008xxxxxx, ID 11, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12A ', Rev '0311', Slot ALevel 2: WWN 500a098008xxxxxx, ID 10, Serial Number ' SHxxxxxxxxxxxxx', Product 'DS22412IOM12 ', Rev '0141', Slot AEMSTue Nov 12 00:15:40 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 0a.10.23: Command aborted by host adapter: HA status 0x4: cdb 0x88:00000001628f7640:00000200. Tue Nov 12 00:15:40 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 0a.10.23: request successful after retry #1/#0: cdb 0x88:00000001628f7640:00000200 (8896).Tue Nov 12 00:16:05 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 0a.10.3: Command aborted by host adapter: HA status 0x4: cdb 0x28:f54c71b8:0100. Tue Nov 12 00:16:05 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 0a.10.3: request successful after retry #1/#0: cdb 0x28:f54c71b8:0100 (8744).Tue Nov 12 00:16:17 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.abortedByHost:error]: Disk device 0a.10.5: Command aborted by host adapter: HA status 0x4: cdb 0x88:0000000177f2f0d0:00000110. Tue Nov 12 00:16:17 +0900 [node2: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 0a.10.5: request successful after retry #1/#0: cdb 0x88:0000000177f2f0d0:00000110 (8529).