Sick disk causes performance impact
Applies to
- Not failed drive(s)
- Does not apply to a single drive already failed
- ONTAP will fail a drive based on a threshold of errors and latency
Issue
- High volume (FlexVol) latency observed.
- In some scenarios, high latency may lead to NFS disconnections
- Running
qos statistics volume latency show
command shows primary delay underdisk
column. Example:
::> qos statistics volume latency show -vserver SVM_name -volume vol_name
Workload ID Latency Network Cluster Data Disk QoS Max QoS Min NVRAM ...
--------------- ------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ...
workload_name 12345 154.92ms 294.00us 0ms 1115.00us 153.36ms 0ms 0ms 157.00us ...
workload_name 12345 117.39ms 376.00us 0ms 1.59ms 115.27ms 0ms 0ms 157.00us ...
workload_name 12345 110.26ms 391.00us 0ms 1.86ms 107.86ms 0ms 0ms 139.00us ...
...
- A single drive exhibits significantly higher utilization and latency in the RAID group. Example:
::> system node run -node node_name -command "priv set -q advanced; statit -e"
...
disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs ...
/aggr1/plex0/rg0:
0a.10.10 31 93.15 0.00 .... . 54.89 26.94 590 38.26 38.85 155 0.00 .... . ...
0a.10.1 33 93.98 0.00 .... . 55.75 26.55 630 38.23 38.83 183 0.00 .... . ...
0a.10.2 19 118.78 9.53 3.50 8515 56.77 10.57 291 52.49 9.60 543 0.00 .... . ...
0a.10.3 21 120.65 10.11 3.80 8440 58.10 10.88 362 52.43 9.50 566 0.00 .... . ...
0a.10.4 20 119.76 9.21 3.27 9108 57.79 10.54 314 52.76 9.44 552 0.00 .... . ...
0a.10.5 100 121.62 10.52 3.22 19375 58.78 10.20 7699 52.32 9.79 4831 0.00 .... . ...
0a.10.6 18 119.96 9.57 3.33 8727 57.97 10.73 216 52.42 9.64 541 0.00 .... . ...
0a.10.7 18 119.06 9.01 3.53 8786 57.71 10.57 223 52.34 9.56 535 0.00 .... . ...
0a.10.8 18 121.28 9.75 3.76 8179 59.29 10.89 235 52.24 9.72 544 0.00 .... . ...
...
- ONTAP events (EMS Logs) may report:
- Several errors and aborts on the drive, prior to marking it as failure. Example:
... scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry ...
... scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry ...
... scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry ...
... config_thread: raid.disk.delete.drl:debug]: aggregate Disk /aggr_name/plex0/rg0/ [...] Deleting dirty region log ...
- A "long" consistency point (CP) in an aggregate. Example:
wafl_exempt08: wafl.cp.toolong:error]: Aggregate aggr_name experienced a long CP.
- Storage Health Monitor IO latency (
shm.threshold.ioLatency
). Example:
[Cluster-01: disk_latency_monitor: shm.threshold.ioLatency:debug]: Disk XX.XX.XX has exceeded the expected IO latency in the current window with average latency of 50 msecs and average utilization of 100 percent. Highest average IO latency: XX.XX.: 50 msecs; next highest IO latency: XX.XX.XX: 6 msecs. Disk XX.XX.XX Shelf X Drawer X Slot X Bay XX [NETAPP X375_TTCRE04TA07 NA03] S/N [#########]