What is the difference between disk predictive failure and physical disk failure in ONTAP 9
Applies to
- ONTAP 9
Answer
To identify between a hard drive predictive failure and a physical disk failure, and understand the differences in the alert displays, as well as how ONTAP handles each situation, here are the details:
Predictive Failure vs. Physical Disk Failure
- Predictive Failure:
Definition: Predictive failure is an early warning that a hard drive is likely to fail soon. It is detected through monitoring tools that analyze various metrics like SMART data, error rates, and performance degradation.
Alert Displays: Predictive failures often generate warning messages or alerts indicating that the drive health is deteriorating. These alerts can include messages such as "
shm.threshold.consecutiveTimeouts:error" or " shm.threshold.highIOLatency:error"Example:
[node1: disk_latency_monitor: shm.threshold.highIOLatency:error]: Disk 0b.00.1 exceeds the average IO latency threshold and will be recommended for failure.[node2: disk_server_1: shm.threshold.consecutiveTimeouts:error]: shm: Disk 0a.10.9 has exceeded the threshold of 11 consecutive timeouts; the system will fail the disk if possible.Handling by ONTAP 9: When a predictive failure is detected, ONTAP 9 will typically:
- Generate an alert to notify administrators.
- Recommend replacing the drive before it fails completely.
- May start proactive data migration to a spare disk to prevent data loss.
- Physical Disk Failure:
Definition: A physical disk failure occurs when a hard drive stops functioning due to hardware issues such as mechanical failure, electronic failure, or other physical damage.
Alert Displays: Physical disk failures generate critical error messages or alerts indicating that the drive is no longer operational. These alerts can include messages such as "
raid.xxx.media.err" or "raid.disk.offline."Handling by ONTAP 9: When a physical disk failure is detected, ONTAP 9 will typically:
- Generate a critical alert to notify administrators.
- Mark the disk as failed and take it offline.
- Use RAID (Redundant Array of Independent Disks) to reconstruct the data on a spare disk if RAID is configured.
- Recommend immediate replacement of the failed disk to maintain data integrity and redundancy
Additional Information
additionalInformation_text
