What is the difference between disk predictive failure and physical disk failure in ONTAP 9

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 406

Visibility:: Public

Votes:: 0

Category:: ontap-9

Specialty:: hw

Last Updated:

Applies to

ONTAP 9

Answer

To identify between a hard drive predictive failure and a physical disk failure, and understand the differences in the alert displays, as well as how ONTAP handles each situation, here are the details:

Predictive Failure vs. Physical Disk Failure

Predictive Failure:

Definition: Predictive failure is an early warning that a hard drive is likely to fail soon. It is detected through monitoring tools that analyze various metrics like SMART data, error rates, and performance degradation.

Alert Displays: Predictive failures often generate warning messages or alerts indicating that the drive health is deteriorating. These alerts can include messages such as "shm.threshold.consecutiveTimeouts:error" or " shm.threshold.highIOLatency:error"

Example：

[node1: disk_latency_monitor: shm.threshold.highIOLatency:error]: Disk 0b.00.1 exceeds the average IO latency threshold and will be recommended for failure.

[node2: disk_server_1: shm.threshold.consecutiveTimeouts:error]: shm: Disk 0a.10.9 has exceeded the threshold of 11 consecutive timeouts; the system will fail the disk if possible.

Handling by ONTAP 9: When a predictive failure is detected, ONTAP 9 will typically:

Generate an alert to notify administrators.
Recommend replacing the drive before it fails completely.
May start proactive data migration to a spare disk to prevent data loss.

Physical Disk Failure:

Definition: A physical disk failure occurs when a hard drive stops functioning due to hardware issues such as mechanical failure, electronic failure, or other physical damage.

Alert Displays: Physical disk failures generate critical error messages or alerts indicating that the drive is no longer operational. These alerts can include messages such as "raid.xxx.media.err" or "raid.disk.offline."

Handling by ONTAP 9: When a physical disk failure is detected, ONTAP 9 will typically:

Generate a critical alert to notify administrators.
Mark the disk as failed and take it offline.
Use RAID (Redundant Array of Independent Disks) to reconstruct the data on a spare disk if RAID is configured.
Recommend immediate replacement of the failed disk to maintain data integrity and redundancy

Additional Information

additionalInformation_text