Sick disk causes performance impact
Applies to
- ONTAP 9
- ONTAP 8
- FAS
- AFF
- Does not apply to a single disk that has already failed (ONTAP will fail a drive based on a threshold of errors and latency)
- Applies to disk(s) that have not failed
Issue
- High FlexVol latency observed.
- High latency may lead to NFS disconnections in some scenarios.
- Running
qos statistics volume latency show
command shows primary delay underdisk
column. - A single drive exhibiting significantly higher utilization and latency than other drives in the RAID group
- this can be validated with the node shell
statit
command
- this can be validated with the node shell
cluster1::> node run -node local -command "priv set -q advanced; statit -e"
...
disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr1/plex0/rg0:
0a.10.10 31 93.15 0.00 .... . 54.89 26.94 590 38.26 38.85 155 0.00 .... . 0.00 .... .
0a.10.1 33 93.98 0.00 .... . 55.75 26.55 630 38.23 38.83 183 0.00 .... . 0.00 .... .
0a.10.2 19 118.78 9.53 3.50 8515 56.77 10.57 291 52.49 9.60 543 0.00 .... . 0.00 .... .
0a.10.3 21 120.65 10.11 3.80 8440 58.10 10.88 362 52.43 9.50 566 0.00 .... . 0.00 .... .
0a.10.4 20 119.76 9.21 3.27 9108 57.79 10.54 314 52.76 9.44 552 0.00 .... . 0.00 .... .
0a.10.5 100 121.62 10.52 3.22 19375 58.78 10.20 7699 52.32 9.79 4831 0.00 .... . 0.00 .... .
0a.10.6 18 119.96 9.57 3.33 8727 57.97 10.73 216 52.42 9.64 541 0.00 .... . 0.00 .... .
0a.10.7 18 119.06 9.01 3.53 8786 57.71 10.57 223 52.34 9.56 535 0.00 .... . 0.00 .... .
0a.10.8 18 121.28 9.75 3.76 8179 59.29 10.89 235 52.24 9.72 544 0.00 .... . 0.00 .... .
0a.10.9 19 121.30 10.90 3.47 8249 58.15 11.07 217 52.26 9.87 526 0.00 .... . 0.00 .... .
- EMS Logs may report several errors and aborts on the Disk , prior to marking it as failure.
scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry #1/#0: cdb 0x28:3b468100:0008 (24080). scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry #1/#0: cdb 0x28:3b4681a8:0008 (24081). scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry #1/#0: cdb 0x88:000000020ab11b00:00000008 (24928). scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry #1/#0: cdb 0x88:00000002b7ff0d00:00000038 (24619). config_thread: raid.disk.delete.drl:debug]: aggregate Disk /aggr01_node02/plex0/rg0/3b.51.1L1 Shelf 51 Bay 1 [NETAPP X481_SMKRE06TSDB NA03] S/N [S4D12BT0] UID [5000C500:8CE40C44:00000000:00000000:00000000:00000000:00000000: 00000000:00000000:00000000] Deleting dirty region log DRL_1. wafl.cp.toolong:error]: Aggregate fas_01_DATA_AGGR experienced a long CP...
- EMS may also report the message: wafl.cp.toolong:error
wafl_exempt08: wafl.cp.toolong:error]: Aggregate fas_01_DATA_AGGR experienced a long CP.