Sick disk causes performance impact

Last updated

Dec 21, 2024
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 2,415

Visibility:: Public

Votes:: 1

Category:: ontap-9

Specialty:: perf

Last Updated:: 12/21/2024, 3:17:58 PM

Applies to

ONTAP 9
ONTAP 8
FAS
AFF
Does not apply to a single disk that has already failed (ONTAP will fail a drive based on a threshold of errors and latency)
Applies to disk(s) that have not failed

Issue

High FlexVol latency observed.
High latency may lead to NFS disconnections in some scenarios.
Running qos statistics volume latency show command shows primary delay under disk column.
A single drive exhibiting significantly higher utilization and latency than other drives in the RAID group
- this can be validated with the node shell statit command

cluster1::> node run -node local -command "priv set -q advanced; statit -e"
...
disk             ut%  xfers  ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr1/plex0/rg0:
0a.10.10          31  93.15    0.00   ....     .  54.89  26.94   590  38.26  38.85   155   0.00   ....     .   0.00   ....     .
0a.10.1           33  93.98    0.00   ....     .  55.75  26.55   630  38.23  38.83   183   0.00   ....     .   0.00   ....     .
0a.10.2           19 118.78    9.53   3.50  8515  56.77  10.57   291  52.49   9.60   543   0.00   ....     .   0.00   ....     .
0a.10.3           21 120.65   10.11   3.80  8440  58.10  10.88   362  52.43   9.50   566   0.00   ....     .   0.00   ....     .
0a.10.4           20 119.76    9.21   3.27  9108  57.79  10.54   314  52.76   9.44   552   0.00   ....     .   0.00   ....     .
0a.10.5          100 121.62   10.52   3.22 19375  58.78  10.20  7699  52.32   9.79  4831   0.00   ....     .   0.00   ....     .
0a.10.6           18 119.96    9.57   3.33  8727  57.97  10.73   216  52.42   9.64   541   0.00   ....     .   0.00   ....     .
0a.10.7           18 119.06    9.01   3.53  8786  57.71  10.57   223  52.34   9.56   535   0.00   ....     .   0.00   ....     .
0a.10.8           18 121.28    9.75   3.76  8179  59.29  10.89   235  52.24   9.72   544   0.00   ....     .   0.00   ....     .
0a.10.9           19 121.30   10.90   3.47  8249  58.15  11.07   217  52.26   9.87   526   0.00   ....     .   0.00   ....     .

EMS Logs may report several errors and aborts on the Disk , prior to marking it as failure.

scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry #1/#0: 
 cdb 0x28:3b468100:0008 (24080).
scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry #1/#0:
 cdb 0x28:3b4681a8:0008 (24081).
scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry #1/#0:
 cdb 0x88:000000020ab11b00:00000008 (24928).
scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry #1/#0:
 cdb 0x88:00000002b7ff0d00:00000038 (24619).
config_thread: raid.disk.delete.drl:debug]: aggregate Disk /aggr01_node02/plex0/rg0/3b.51.1L1 Shelf 51 Bay 1 
 [NETAPP X481_SMKRE06TSDB NA03] S/N [S4D12BT0] UID [5000C500:8CE40C44:00000000:00000000:00000000:00000000:00000000:
 00000000:00000000:00000000] Deleting dirty region log DRL_1.
  wafl.cp.toolong:error]: Aggregate fas_01_DATA_AGGR experienced a long CP...

EMS may also report the message: wafl.cp.toolong:error

wafl_exempt08: wafl.cp.toolong:error]: Aggregate fas_01_DATA_AGGR experienced a long CP.

EMS may also report the message shm.threshold.ioLatency

[Cluster-01: disk_latency_monitor: shm.threshold.ioLatency:debug]: Disk XX.XX.XX has exceeded the expected IO latency in the current window with average latency of 50 msecs and average utilization of 100 percent. Highest average IO latency: XX.XX.: 50 msecs; next highest IO latency: XX.XX.XX: 6 msecs. Disk XX.XX.XX Shelf X Drawer X Slot X Bay XX [NETAPP X375_TTCRE04TA07 NA03] S/N [#########]