Skip to main content

NetApp_Insight_2020.png 

NetApp Knowledgebase

When should a drive be failed (rebuild) versus letting it fail gracefully (sick disk copy)?

Views:
639
Visibility:
Public
Votes:
0
Category:
disk-drives
Specialty:
hw
Last Updated:

Applies to

  • ONTAP Drives

Answer

  1. I see errors in the system log (Messages/EMS). Do I have a bad disk? Does this mean I need to fail it out manually?


When you see drives with errors, this does not necessarily mean there is an issue:

  • Hard drive manufacturers have evolved considerably since the days when hard drives were invented
  • Software and firmware have evolved to handle errors much more intelligently and gracefully.
  • A hard drive might experience a one-time error but still be in service many years later.
  • A drive might be reporting several errors in a row, but still be responsive/usable.
  • If an error is detected, it has thresholds to know if a drive is bad enough to fail it out.
  • If a disk has caused enough errors, but no impact to users, it will gracefully fail the disk out and put it into testing (maintenance center).
    • This testing will involve giving the drive a thorough checkout and then, if it passes, the disk will be returned to service.
    • A drive returned to service means that it has been tested from the beginning to the end and has returned no errors.
    • Any errors that might have existed were corrected internally and the drive is now ready for service.
    • If a disk enters maintenance center the third time, it is failed out and an AutoSupport message is sent to NetApp for automatic disk replacement, and a drive usually is automatically shipped out.
  • If you are seeing errors reporting timeouts/bad sectors but not seeing latency or application/user complaints or timeouts, that means the software is working correctly. 
  1. I see errors, but am unsure if I have an issue. It seems like things are running worse since I started seeing the errors, or I notice a timeout that matches when my application reports latency, but normally it is fine?
  • This might be a bad drive. See the following sections to understand what to look for in statit,
  • statit is the command at the node level advanced privilege mode that will be used to understand disk i/o latency/utilization on a per drive level.
  1. I see errors and/or have timeouts. It is constant or easily traceable when the disk fails. Is there a way to confirm?

The statit command will help you determine if there is an issue. Run the following:

Cluster::> node run node01 
node01> priv set advanced
node01*> statit -b

[count 30 seconds]

node01*> statit –e

This will produce a very detailed output. The following is a sample with the disk in question highlighted:

disk             ut%  xfers  ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr3/plex0/rg1:
3a.61             64 118.75   35.96   5.81  4376  45.17   4.72  5971  37.62   8.06  2178   0.00   ....     .   0.00   ....     .
4a.76            100 118.26   34.88   5.67 16441  46.64   5.61 10991  36.74   7.88  6229   0.00   ....     .   0.00   ....     .
3a.45             68 126.40   43.41   5.35  4810  47.52   4.51  6050  35.47   7.60  2167   0.00   ....     .   0.00   ....     .

Notice that 4a.76 has the following versus surrounding disks 3a.61 and 3a.45:

  1. Higher ut% or disk active percentage

  2. Higher latency in each usec column for uread, write, and cpread by 4-12 ms

  • The disk was actually causing an application/user issue as well, and it was worth failing out.
  • This is the command that will help measure if a disk might need to be failed out.
  • It is recommended to get several iterations of statit to see if it is a problem disk.
  • If each iteration shows higher output, and there are no other hardware issues, then it is a bad drive.
  • The system log must be checked to see if this disk pops up.
  • Another scenario that can occur is a drive might have timeouts only at specific times, and the application might report issues.
  • A disk failure might be required.
  • statit might not be useful as it is only at certain times this occurs, and grabbing the statit output might be impossible. 
  1. I have determined that I have issues. Is it better to gracefully fail or forcefully remove and do a recovery?

 If the load is very high on the aggregate, either a copy or failure will be painful until the failure completes.

  • If one disk is bad in a raid group, and fits the scenario where each iteration of statit consistently shows higher latency/utilization, and it is not due to fragmentation/workload, in general, a rebuild would be a better option. The difference is the -i flag. Run the following:

cluster1::> storage disk fail -disk 1.1.16 -i true
WARNING: The system will not prefail the disk and its contents will not be
copied to a replacement disk before being failed out. Do you want to
fail out the disk immediately? {y|n}: y

disk fail [-i] [-f] <disk_name> Filer> disk fail -i -f 1a.01.16

  • If a disk is showing intermittent pain, as in the second example above, (that is, it only occurs once an hour or so) but is not constant, a graceful failure is a better option as the disk is not showing issues most of the time and the issue might only last for a minute, once an hour.
  • If multiple disks are having issues in a raid-group, this situation depends on how many disks are in the raid-group, how bad each disk is, and the load on the system.
  • The time to recover/rebuild will be varied. Ensure that a backup/DR plan exists as it is more risky in this case and can cause data loss/downtime. If this is the case, there are no general recommendations available and NetApp Support and/or the account team must be engaged to help provide the best guidance.
  • If unsure, contact NetApp Technical Support.

Additional Information

For more detailed information on running the statit command and utilizing it to monitor disk i/o statistics, see KB article How to assess disk-level response times?

Recommended Best Practices:
  • Try to keep the disk load to less than 50-70% of maximum IOPs, so if a disk fails, it will not cause user latency. If it is required to push above this threshold, plan to have a fallback in place.

  • Make sure all current recommended settings are set, the most recent recommended ONTAP version for your hardware is installed, and drive, shelf module, and ACP (if SAS) firmware is at the latest.

  • Follow all recommended best practices for Multi-Path HA and cabling. 

  • Consult your NetApp account team if you feel the load or versions of Data ONTAP might need a review to make sure they are at appropriate levels. 

  • If engaging NetApp Support, it is recommended to collect a Perfstat or, at the very minimum run statit.