High Latency on disks with no HDD errors

Last updated

Sep 5, 2024
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 1,695

Visibility:: Public

Votes:: 0

Category:: fas-systems

Specialty:: perf

Last Updated:: 9/5/2024, 2:34:46 PM

Applies to

ONTAP 9
FAS systems
Cloud Volumes ONTAP systems with hard disk drives (HDD)

Issue

Increased latency to end NAS/SAN users.
qos statistics volume latency show points to seconds of latency in disk, but statistics on the disk histogram shows 8ms or less

cluster1::> qos statistics volume latency show -vserver vs0 -volume vs0_vol0
Workload            ID    Latency    Network    Cluster       Data       Disk    Qos Max    Qos Min      NVRAM      Cloud  FlexCache    SM Sync         VA
--------------- ------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
-total-              -   455.00us   158.00us        0ms   297.00us        0ms        0ms        0ms        0ms        0ms        0ms        0ms        0ms
vs0_vol0-wid1..  15658   109.00ms   155.00us        0ms   273.00us    108.2ms        0ms        0ms        0ms        0ms        0ms        0ms        0ms

High io_queued is seen in the disk object for only one disk or a few disks, generally only one disk in most cases
There are no hardware errors for the single disk in the event log nor any other hardware issues on the shelf or stack that could explain i/o queuing on a single drive
- Failing a drive may move the i/o queuing to another disk

Example: The statistics command disk object shows high queuing on a disk and statit shows 100% busy disk i/o on disk 0c.23.13

Cluster::> set -privilege diag

Warning: These diagnostic commands are for use by NetApp personnel only.
Do you want to continue? {y|n}: y

Cluster::*> statistics start -object disk -counter io_queued
Statistics collection is being started for sample-id: sample_148


Cluster::*> statistics show -filter "io_queued>100"


Object: disk
Instance: 0d.23.13
Start-time: 12/5/2022 16:48:26
End-time: 12/5/2022 16:51:58
Elapsed-time: 212s
Scope: node1
Number of Constituents: 1 (complete_aggregation)
    Counter                                                     Value
    -------------------------------- --------------------------------
    io_queued                                                     818

1 entry was displayed.

Cluster::*> node run -node node1 -command statit -b

Note: Post 30 seconds observed queuing on 0d.23.13, the hot disk is seen using statit

Cluster::*> node run -node node1 -command statit -e
...
disk             ut%  xfers  ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr1_node1/plex0/rg2:
0d.23.18          25  69.19    0.01   2.00  2033  31.50  60.67   175  37.69  51.19   169   0.00   ....     .   0.00   ....     .
0a.21.16          24  69.77    0.01   2.00  6964  32.11  59.56   194  37.66  51.18    90   0.00   ....     .   0.00   ....     .
0d.22.17          57 231.26  133.56   5.45  2075  26.91  29.41   573  70.79   9.87   641   0.00   ....     .   0.00   ....     .
0d.23.22          57 230.68  132.96   5.46  1845  26.83  29.56   646  70.90   9.74   604   0.00   ....     .   0.00   ....     .
0d.23.13          95 295.63  198.16   4.10  5472  26.83  29.76  1371  70.63   9.91  1975   0.01   ....     .   0.00   ....     .
0d.22.18          57 231.26  133.55   5.38  2080  26.84  29.60   561  70.86   9.73   634   0.00   ....     .   0.00   ....     .
0a.20.18          57 231.69  133.54   5.42  1846  27.00  29.46   647  71.15   9.78   608   0.00   ....     .   0.00   ....     .
0a.20.16          57 233.00  134.49   5.48  1879  27.08  30.09   634  71.43   9.83   594   0.00   ....     .   0.00   ....     .
0d.22.19          57 231.98  134.18   5.41  2099  26.87  29.67   567  70.93   9.87   646   0.00   ....     .   0.00   ....     .

Cluster::*> set admin