Loss of connectivity after disk failure
Applies to
- Hardware disk failure
- Long Consistency Point (CP) errors reported
- Data outage
Issue
- Data outage noticed on customer side, for some seconds. Examples:
- NFS exports disconnected
- CIFS shares not accessible
- Missing VMs
- Hardware disk failure reported. Example:
[node_name: config_thread: raid.config.filesystem.disk.not.responding:notice]: File system Disk /aggr_name/plex0/rg0/0a.0.1 Shelf 0 Bay 1 [...] is not responding.
[node_name: monitor: monitor.globalStatus.nonCritical:error]: Disk on adapter FPF1939S03T:9, shelf 1, bay 5, not responding.
- ONTAP event error for a long CP reported in data and/or root aggregate. Example:
[node_name: wafl_exempt13: wafl.cp.toolong:error]: Aggregate aggr0 experienced a long CP.
[node_name: wafl_exempt16: wafl.cp.toolong:error]: Aggregate aggr_name experienced a long CP.
- Too long Consistency Point (CP) phase 2 reported in the sktraces AutoSupport section, when flushing data to disks. Example:
2024-1-1T00:01:01Z 12345678912345678 [5:0] CRUISE_6: CP toolong: aggr0[5678901] CP_P2_FLUSH 498765ms
2024-1-1T01:01:05Z 23456789123456789 [2:0] CRUISE_6: CP toolong: aggr_name[5789012] CP_P2_FLUSH 512345ms