NetApp E-Series Products
The Unreadable Sector Management (USM) feature provides a controller-based method for handling unreadable sectors that are detected during normal I/O processing, and long-running operations such as reconstructions. The feature is designed to be mostly transparent to the end-user and therefore no configurable options are available and the functionality cannot be disabled. The five main functional improvements provided by USM are:
- Better reporting of the unreadable sector (and USM related conditions) in the Major Event Log (MEL).
- Persist reporting of unreadable sectors.
- All continuation of reconstructions and other long-running operations despite media errors.
- Successful completion of writes to optimal RAID 5 volumes even when parity cannot be generated.
- Persist media error conditions through RAID reconfiguration operations (DVE, DCE).
Because NetApp supports both FC-SCSI and SATA disks in a single subsystem, the feature is designed in a way that does not rely on physical disk functionality and is handled completely in the controller firmware. However, there is no attempt made by the controller firmware to emulate capabilities for disks that do not natively support them.
For the purposes of this article, an 'unreadable sector' is defined as a volume logical block address (LBA) that is considered completely unreadable due to a physical disk media related double fault condition OR a physical disk media related single fault condition on non-redundant volumes (RAID 0). An unreadable sector is unrecoverable and the data contained in that location should be considered lost and must be recovered by some other means.
How does the unreadable sector database work and what is logged to it?
The unreadable sector database is maintained in stable storage and contains an entry for each unreadable sector detected. It is recorded using volume centric information, which includes:
- Unique Volume Identifier (not the SSID)
- Volume LBA
- Block Count
The benefit of keeping the database in stable storage is that it allows the reporting of unreadable sectors to be persistent, as it will survive reboots, volume reconfigurations, and firmware upgrades (provided the upgrade is to a code that supports USM), volume state changes, and volume transfers.
To view the USM database:
- In SANtricity (Array Management Window) by going to Monitor >> Reports >> "Unreadable Sectors Log"
- SANtricity CLI Command:
show storageArray unreadableSectors
- In SANtricity Array Manager (SAM) by going to Support >> Support Center >> Diagnostics Tab >> selecting "View/Clear Unreadable Sectors"
How and why are unreadable sectors entered in the database?
Entries can be made to the unreadable sector database on any read operation, be it a host I/O or an internal operation that requires a read from the physical media. For redundant configurations, this means that both data and parity (or mirror) locations can generate an entry to the unreadable sector database. During a host read I/O, the controller will first attempt to reconstruct the data (in an optimal, redundant configuration) if a physical disk returns a media error. If the reconstruction of that data fails, entry to the unreadable sector database will be made and the host I/O will be failed with a sense key of Hardware Error (0x04).
Media Scan with Redundancy Check:
During the scan, all data and parity information is read and compared. When a media error is encountered, whether it is a new media error or one that has already been logged in the database, the following action takes place. An attempt will be made to reconstruct the data at the location, and if successful a write-back to disk will take place. If an entry existed for the write location in the unreadable sector database, then the entry will be removed. If the data cannot be reconstructed, then entries will be made to the database for all unreadable sectors involved.
Reads will be processed during a RAID5 reconstruction. If an existing entry is found in the unreadable sector database, the corresponding sector on the reconstructing drive will be added to the unreadable sector database and a critical MEL event will be generated for the data lost on the reconstructing drive. The sector on the reconstructing drive will be written with a known data pattern.
Two scenarios exist if a new media error is returned from disk while a RAID 5 reconstruction is in progress:
A read error occurs on a data segment
Action taken: Two unreadable sectors will be added to the unreadable sector database, one for the defective sector on the data drive and the second for the data that could not be regenerated on the reconstructing drive. A critical MEL event will be generated for both sectors, as user data is lost in both places. As a consequence of this operation, an unreadable sector entry will be added to the in-memory table for the parity drive.
A read error occurs on a parity segment
Action taken: One entry will be made to the unreadable sector database for the location that could not be regenerated on the reconstructing drive. A single MEL event will be logged for the lost user data. An additional entry will be made to the in-memory table for the location on the parity drive.
Reads will be processed during RAID 1 reconstructions. If an existing entry is found in the unreadable sector database, the corresponding sector on the reconstructing drive will be written with a known data pattern and no MEL event will be generated.
If a new media error is encountered from the source disk while a reconstruction is in progress, then the failed sector will be added to the database and a critical MEL event will be generated. The reconstructing drive will be written with a default pattern.
Copy-Back Operations: During a copy-back operation, the hot spare drive will be read and copied to the replacement drive. If an existing unreadable sector is found in the database, no new entries are made and the logical to physical mapping is updated. If a new unreadable sector is detected, an attempt will be made to reconstruct the data. If the data cannot be recovered, then a new unreadable sector entry will be added for the target sector, and a critical MEL event will be generated. The target sector will be written with the known data pattern and the copy-back operation will continue to completion.
Immediate Availability Format (IAF) When IAF detects a media error, it places an entry in the unreadable sector database for the unreadable sector and corresponding parity block. A critical MEL event is generated for the site of the media error.
Dynamic Reconfiguration When performing a Volume Reconfiguration, the number of drives in the volume group, the RAID level, or the stripe size could change. Unreadable sectors detected on the source during this operation will be logged only if an existing entry was not already in the database. These unreadable blocks will be migrated to new locations in the target configuration, the physical location in the unreadable sector database for the logical block is updated, and a MEL event is generated.
Implementing USM does not guarantee that physical media errors will not impact the user’s access to their data. Unreadable Sector Management is designed to lessen this possibility by providing the end-user with notifications that these errors exist. USM, when used in conjunction with a feature such as media scan, can provide a mindful system administrator with the opportunity to take proactive measures and prevent hardware problems from impacting data access.
Host reads that intersect with sectors already logged in the unreadable sector database will return a sense key of 0x03 (Medium Error) and an ASC/ASCQ of 0x11/0x00 (Unrecoverable read error).
There is a limiting factor that only allows a maximum of 1000 entries to the unreadable sector database. This 1000 entry limit is across all volume groups, volumes, and disks. Once the database is full, controller behavior will be as follows:
- For new unreadable sectors encountered during a reconstruction, the reconstructing drive is failed but no entry is made to the unreadable sector database.
- For new unreadable sectors encountered during host I/O, the host I/O is failed and no entry is made.
- For all new unreadable sectors detected after the database is full, a critical MEL event will be generated. Each subsequent attempt to access this location will generate a critical event as no entry can be made for it in the database.
Clearing the Unreadable Sector database:
Entries can be removed from the unreadable sector database in one of the following methods
- User Request: A user may request to clear database entries for a specified volume, volume group, or entire subsystem via the SANtricity GUI or a SANtricity CLI script. This type of request clears all unreadable sectors at the specified level and causes the following to occur:
- A known data pattern to be written to the corresponding sectors.
- Correct parity to be generated for the stripe containing the unreadable sectors.
- Entries to be removed from the database.
From SANtricity, it will be possible to view the unreadable sector entries and select the clear option by going to Monitor >> Reports >> "Unreadable Sectors Log. Or by using the following SANtricity CLI script to clear the entries:
clear allVolumes unreadableSectors;
From the controller SAM, go to Support >> Support Center >> Diagnostics Tab >> selecting "View/Clear Unreadable Sectors" >> select an entry >> clear.
- Successful Write: A successful write to a sector entered in the USM database will also remove that entry. When a write occurs that intersects a known unreadable sector, that write will be converted to a WRITE AND VERIFY to ensure that the sector has been repaired and is readable. If the WRITE AND VERIFY returns a good status, then the sector is removed from the database.
Restrictions Imposed by USM: if unreadable sectors exist in the database for a volume group or volume, certain features are disabled.
- Remote Volume Mirroring (RVM) Controller firmware will reject the creation of mirror relationships whenever unreadable sector entries exist for the primary volume. If an unreadable sector is encountered during the synchronization, both the synchronization and the mirror relationship are failed.
- Snapshots: Controller firmware will reject the creation of snapshots whenever entries exist in the USM database for a volume. This applies to both the source and the associated repository volume.
- Volume Copy: Controller firmware will reject Volume Copy requests when the source volume contains unreadable sector entries in the USM database.
- Reconfiguration Operations: Controller firmware will reject Volume Reconfiguration requests made for volumes that have unreadable sector entries in the USM database.
- Volume Import: If a volume group is imported that would cause the unreadable sector database to overflow, the import will be failed and the new volume will be kept in an offline state. A MEL event will be generated and a recovery guru action is logged, explaining that the number of entries in the unreadable sector database must be reduced before the import can be performed.
Note: If you are unable to view the entire content of this article please log in to kb.netapp.com