Single SSD causing performance issue

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 5,803

Visibility:: Public

Votes:: 2

Category:: ontap-9

Specialty:: perf

Last Updated:

Applies to

AFF, ASA and C-Series systems
ONTAP versions without a fix for Bug ID 1479263

Issue

A single problematic SSD drive can cause performance issues on an aggregate due to read/write I/O latency.
If the disk is partitioned, the disk can impact both HA controller partners and more than one aggregate.
You see high latency on a single SSD (For example: disk 0c.01.5):

node> statit -e  
                       Disk Statistics (per second)  
        ut% is the percent of time the disk was busy.  
        xfers is the number of data-transfer commands issued per second.  
        xfers = ureads + writes + cpreads + greads + gwrites  
        chain is the average number of 4K blocks per command.  
        usecs is the average disk round-trip time per 4K block.  
disk             ut%  xfers  ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr1/plex0/rg0:

0a.00.9            2 275.84    0.00   ....     .  95.38  36.71    32 180.46  18.28    41   0.00   ....     .   0.00   ....     .
0a.00.1            2 276.54    0.50   1.40   120  95.88  36.57    31 180.16  18.30    40   0.00   ....     .   0.00   ....     .
0a.00.3            1 2659.57  2030.59   3.70   131 266.35   7.80    89 362.63   2.86   210   0.00   ....     .   0.00   ....     .
3d.00.4            1 2667.07  2047.99   3.79   112 261.65   8.27    56 357.43   2.93   143   0.00   ....     .   0.00   ....     .
0a.00.5            1 2733.05  2096.08   3.72   108 271.35   8.25    89 365.63   2.95   153   0.00   ....     .   0.00   ....     .
3d.00.6            1 2506.70  1916.42   3.43   124 243.45   8.19    66 346.83   2.85   146   0.00   ....     .   0.00   ....     .
0a.00.7            1 2450.61  1897.82   3.47   109 224.46   8.40    84 328.33   2.84   150   0.00   ....     .   0.00   ....     .
3d.00.8            1 2462.91  1902.72   3.58   117 228.55   8.35    69 331.63   2.89   149   0.00   ....     .   0.00   ....     .
3d.00.10           1 2500.00  1913.12   3.45   117 238.25   7.96    78 348.63   2.76   152   0.00   ....     .   0.00   ....     .
3d.00.2            1 2428.81  1839.93   3.54   117 243.75   7.98    88 345.13   2.92   149   0.00   ....     .   0.00   ....     .
3d.00.0            1 2451.11  1877.52   3.44   120 237.35   8.17    97 336.23   2.89   153   0.00   ....     .   0.00   ....     .
0c.01.5           95 2352.92  1538.77   6.53  2579 385.19  12.08  2353 428.96   3.56  2176   0.00   ....     .   0.00   ....     .

The EMS logs show disk errors similar to the following:

Tue May 17 08:06:00 +0000 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.checkCondition:error]: Disk device 0c.01.5: Check Condition: CDB 0x8a:000000019b222800:00000120: Sense Data SCSI:aborted command - (0xb - 0x2f 0x14 0x0)(4509). Tue May 17 08:06:00 +0000 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.checkCondition:error]: Disk device 0c.01.5: Check Condition: CDB 0x8a:000000019b222928:00000010: Sense Data SCSI:aborted command - (0xb - 0x2f 0x14 0x0)(4512). Tue May 17 08:06:00 +0000 [node1: scsi_ecmdblk_strthr_admin: scsi.cmd.checkCondition:error]: Disk device 0c.01.5: Check Condition: CDB 0x8a:000000019b222940:000000c0: Sense Data SCSI:aborted command - (0xb - 0x2f 0x14 0x0)(4514).

The drive recovers a short time later:

Tue May 17 08:06:00 +0000 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 0c.01.5: request successful after retry #0/#1: cdb 0x8a:000000019b222800:00000120 (5017). Tue May 17 08:06:00 +0000 [node1: scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 0c.01.5: request successful after retry #0/#1: cdb 0x8a:000000019b222940:000000c0 (5017).

This disk may be partitioned which will impact more than one aggregate as seen below:

Tue May 17 08:06:00 +0000 [node1: wafl_exempt00: wafl.cp.toolong:error]: Aggregate aggr1 experienced a long CP. Tue May 17 08:06:45 +0000 [node1: wafl_exempt00: wafl.cp.toolong:error]: Aggregate aggr2 experienced a long CP.