Skip to main content
NetApp Knowledge Base

Sick disk causes performance impact

Views:
2,586
Visibility:
Public
Votes:
1
Category:
ontap-9
Specialty:
perf
Last Updated:

Applies to

  • Not failed drive(s)
    • Does not apply to a single drive already failed
    • ONTAP will fail a drive based on a threshold of errors and latency

Issue

  • High volume (FlexVol) latency observed.
    • In some scenarios, high latency may lead to NFS disconnections
  • Running qos statistics volume latency show command shows primary delay under disk column. Example:

::> qos statistics volume latency show -vserver SVM_name -volume vol_name
Workload            ID    Latency    Network    Cluster       Data       Disk    QoS Max    QoS Min      NVRAM ...
--------------- ------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ...
workload_name    12345   154.92ms   294.00us        0ms  1115.00us   153.36ms        0ms        0ms   157.00us ...
workload_name    12345   117.39ms   376.00us        0ms     1.59ms   115.27ms        0ms        0ms   157.00us ...
workload_name    12345   110.26ms   391.00us        0ms     1.86ms   107.86ms        0ms        0ms   139.00us ...
...

  • A single drive exhibits significantly higher utilization and latency in the RAID group. Example:

::> system node run -node node_name -command "priv set -q advanced; statit -e"
...
disk             ut%  xfers  ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs   ...
/aggr1/plex0/rg0:
0a.10.10          31  93.15    0.00   ....     .  54.89  26.94   590  38.26  38.85   155   0.00   ....     .   ...
0a.10.1           33  93.98    0.00   ....     .  55.75  26.55   630  38.23  38.83   183   0.00   ....     .   ...
0a.10.2           19 118.78    9.53   3.50  8515  56.77  10.57   291  52.49   9.60   543   0.00   ....     .   ...
0a.10.3           21 120.65   10.11   3.80  8440  58.10  10.88   362  52.43   9.50   566   0.00   ....     .  ...
0a.10.4           20 119.76    9.21   3.27  9108  57.79  10.54   314  52.76   9.44   552   0.00   ....     .  ...
0a.10.5          100 121.62   10.52   3.22 19375  58.78  10.20  7699  52.32   9.79  4831   0.00   ....     .  ...
0a.10.6           18 119.96    9.57   3.33  8727  57.97  10.73   216  52.42   9.64   541   0.00   ....     .  ...
0a.10.7           18 119.06    9.01   3.53  8786  57.71  10.57   223  52.34   9.56   535   0.00   ....     .  ...
0a.10.8           18 121.28    9.75   3.76  8179  59.29  10.89   235  52.24   9.72   544   0.00   ....     .  ...
...

  • ONTAP events  (EMS Logs) may report:
    • Several errors and aborts on the drive, prior to marking it as failure. Example:

... scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry ...
... scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry ...
... scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry ...
... config_thread: raid.disk.delete.drl:debug]: aggregate Disk /aggr_name/plex0/rg0/ [...] Deleting dirty region log ...

 

  • A "long" consistency point (CP) in an aggregate. Example:

wafl_exempt08: wafl.cp.toolong:error]: Aggregate aggr_name experienced a long CP.

  • Storage Health Monitor IO latency (shm.threshold.ioLatency). Example:

[Cluster-01: disk_latency_monitor: shm.threshold.ioLatency:debug]: Disk XX.XX.XX has exceeded the expected IO latency in the current window with average latency of 50 msecs and average utilization of 100 percent. Highest average IO latency: XX.XX.: 50 msecs; next highest IO latency: XX.XX.XX: 6 msecs. Disk XX.XX.XX Shelf X Drawer X Slot X Bay XX [NETAPP   X375_TTCRE04TA07 NA03] S/N [#########] 

 

 

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.