Skip to main content
NetApp Knowledge Base

Understanding Media and Recovered Disk Errors

Views:
3,780
Visibility:
Public
Votes:
5
Category:
not set
Specialty:
kcsv6
Last Updated:

Applies to

  • ONTAP
  • Disk media errors

Answer

What is a media error?

A media error is an event where a storage disk was unable to perform the requested I/O operation because of problems accessing the stored data.

Media errors are more common on read transactions but might occur on writes as well. A media error on a write may occur when the disk has problems locating the position to write the data. On reads, in addition to these positioning faults, the disk may experience problems retrieving the data. When a disk writes data, it writes other information as well, such as to record the position, note CRC or checksum to confirm data write integrity.

When this information cannot be correctly accessed on the first attempt, the disk may classify this as a media error.

Media errors are subclassified into recoverable and unrecoverable. Both types of errors are expected. In fact, disk vendors go so far as to note a rate for recoverable and unrecoverable media errors. These error rates are expressed in terms of errors per bits of media access.

Note: This is media access. The 512 bytes of user data is only a portion of the data written for each sector. Bytes for headers, trailers, CRCs, gap records all must be counted for these error rates, as well as data retrieved for read-aheads that might not ever be used.

What are recoverable errors?

Recoverable errors are more common and so have a higher expected error rate. Recovered or recoverable errors indicate that a disk may have to do something more than the initial access to correctly access the data. The disk was ultimately able to provide the requested data with no intervention by RAID. A retry may be needed, or it may occur that retries do not produce the correct data and that some error correction has been done.

Note that disks always record error correction codes with data. These correction codes are used to reliably reproduce missing data bits. As a disk reads data, it computes a CRC for the data being read; it then compares this CRC to that stored with the data. If they do not match, then error correction codes may be applied to reproduce the missing bits.

Data recovery generally is a series of steps from retrying, to repositioning and retrying, and applying error correction codes. If one of these steps is successful, then the disk considers this a recovered data operation. If all measures of retrieving or reproducing data fail, then this becomes an unrecoverable data operation.

How does a system recover from media errors during normal operations?

Since media faults are expected they can be correctly handled when disks are part of a RAID storage system. The filer handles these events in the following manner:

Firstly, the filer's software looks at the occurrence of recoverable and unrecoverable block errors with respect to the data transfer rate. If the error rates exceed a certain threshold, then a module called the Storage Health Monitor (SHM) will generate an AutoSupport message. This SHM actually checks a variety of parameters besides these error rates, looking for things like excessive time to completion of I/O's, or excessive timeouts.

If an error is a recoverable error, then nothing is done with the transaction. It was, after all, recoverable. The disk driver considers this a successful I/O operation. However, the SHM does note the event and considers this factor when calculating error to data rate.

If an error is unrecoverable, then RAID action is taken. The disk actually returns the logical block address of the error as part of the error information. The controller takes that LBA and issues to the disk a command to reassign that bad block address. The disk performs internal functions so that from that point on, access to that LBA causes access to a different portion of the physical media. Note that disks have a large pool of sectors reserved for just such a transaction.

If the management function of reassigning the LBA to a different physical location fails for any reason then an error is reported, then the controller responds to this error by failing the disk. A disk that can not successfully reassign a bad block is no longer used.

If the reassignment is successful, then the new physical location associated with the LBA is rewritten.

What happens when there is a media error in degraded mode?

In degraded mode, reconstruction might not occur. A block-level failure of a disk when another disk in the same RAID group is failed is for that sector a double error condition. RAID-4 can handle a single failure condition, but losing a sector on another disk means that for that stripe there are two failed disks. As such RAID-4 cannot figure out what is the missing data. It's the functional equivalent of 1 equation with 2 unknowns. It can't be solved.

A multi-disk panic is the end result. This is why failing a disk for media faults is not a good idea. While a disk may have a media fault on a sector, it is extremely unlikely that another disk will have a media fault at the same block address. But, if you fail a disk then it means a media fault on *any* sector then become effectively a double disk failure.

Should I be concerned with media and recovered disk errors?

Media and recovered errors are reported as part of normal drive operations. They do not indicate a drive failure.
A number of media or recovered errors during disk scrub are common as the disks scrub process reads the entire disk.
Storage Health Monitor will flag the drive if the media errors exceed expected error rates. This condition will generate a Predicted Failure AutoSupport message and a case will be created. When a drive is flagged as having a high error rate, a case is created and you are notified of the problem.

Should I be concerned with media and recovered disk errors?

No, these errors are part of normal drive operations and ONTAP (through Storage Health Monitor) will warn you if a drive is about to fail and should be removed.

I have been using NetApp drives for a long time and these new drives have a lot more media errors. Is something wrong with the new drives?

If ONTAP has not flagged the disk drives for removal, the drive is operating normally. NetApp uses drives of greater density to provide a large amount of storage in the same space. The more dense the drive, the more likely media errors will occur. This is normal and will continue with future drive technologies. NetApp closely monitors current drive technologies and will act if high failure rates for a product family are identified.

But I heard there is a general rule to remove drives with 'lots of media errors'?

ONTAP will notify you if you need to be concerned about the health of a drive.

Interpreting Media and Recovered Disk Error Messages

A full message line in the event log will appear similar to the following:

Disk media errors
 

Sample of a media error - recovered

Disk media errors

line 1: The adapter driver reports a recovered error (sense data 1 17, 3), during a read operation (op 0x28) on drive 8a.22, sector 180529. This means the drive had to do some additional work to read the data in this sector. The sector remains valid and is not reassigned. The three digit code (1 17, 3) is the sense data reported by the drive. This code is translated into a human readable meaning "recovered error" by the system.
line 2: The SCSI layer also reports the same error. The SCSI layer reports the sense data in hex and adds a fourth code called a FRU. The FRU is used by the drive vendor.

Notes:

  1. RAID is not involved in recovery of this error. The drive recovered from the error internally and has served the operation as requested.
    This is often referred to as a 'recovered error' or a 'recovered media error'.
  2. The sector involved in a recovered error is not reassigned.
    The sector may be accessed again with no errors or it may report recovered errors.
  3. No further action is required. The drive will continue to run normally and the data is safe.
Sample of a media error - unrecovered

Disk media errors

line 1: The adapter driver reports an unrecovered read error (sense data 3 11, 0), during a read operation (op 0x28) on drive 7a.38, sector 37565872. This means the drive was not able to provide the data requested from this sector. The three digit code (3 11, 0) is the sense data reported by the drive. This code is translated into a human readable meaning "unrecovered read error" by the system.
line 2: Reports the sector in question will be reassigned.
line 3: The adapter driver reports the serial number of the drive with the unrecovered read error.
line 4: The SCSI layer also reports the same error. The SCSI layer reports the sense data in hex and adds a fourth code called a FRU. The FRU is used by the drive vendor.

Note: The disk was not able to recover from this error; thus it reports an unrecovered read error. This is where RAID takes over; see the next section.

Process for recovering unrecovered media errors

Disk media errors

line 1: The adapter driver reports an unrecovered read error (sense data 3 11, 0), during a read operation (op 0x28) on drive 7a.38, sector 37565872. This means the drive was not able to provide the data requested from this sector. The three digit code (3 11, 0) is the sense data reported by the drive. This code is translated into a human readable meaning "unrecovered read error" by the system.
line 2: Reports the sector in question will be reassigned.
line 3: The adapter driver reports the serial number of the drive with the unrecovered read error.
line 4: The SCSI layer also reports the same error. The SCSI layer reports the sense data in hex and adds a fourth code called a FRU. The FRU is used by the drive vendor.
line 5: The RAID layer reports a read error on this disk in block 4695734. This is the block that was stored on sector 37565872.
line 6: The RAID layer reports that the data in the bad block was rewritten from parity.
line 7: The adapter driver reports the sector was successfully reassigned. The bad sector, 37565872, will not be used again.

Note: Once the bad block is rewritten from parity, no further action is required. The drive will continue to run normally and the data is safe. Do NOT fail a drive for this error.

Additional Information

additionalInformation_text

 

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.