WAFL inconsistency due to lack of spares and degraded aggregate for prolonged time
Applies to
- AFF/FAS Systems
- ONTAP 9
Issue
In EMS
logs:
- System used all adequate spare disks and reports on spares low
Sat Mar 12 05:17:26 +0200 [node_2: config_thread: raid.rg.spares.low:error]: /aggr2_2/plex0/rg0
Sat Mar 12 05:17:26 +0200 [node_2: config_thread: callhome.spares.low:error]: Call home for SPARES_LOW
- Following the next failure raid group is in degraded state
Mon Apr 04 02:00:00 +0200 [node_2: statd: monitor.raiddp.vol.singleDegraded:error]: data disk in RAID group "/aggr2_2/plex0/rg0" is broken.
- Disk failures continue
Thu May 05 21:03:07 +0200 [node_2: config_thread: raid.rg.recons.cantStart:error]: The reconstruction cannot start in RAID group /aggr2_2/plex0/rg0: No matching disks available in spare pool, targeting any spare pool
Wed May 04 03:00:00 +0200 [node_2: statd: monitor.brokenDisk.notice:notice]: When two disks are broken in raid_dp volume, the system shuts down automatically every 24 hours to encourage you to replace the disk. If you reboot the system, it will run for another 24 hours before shutting down.
Wed May 04 03:00:00 +0200 [node_2: statd: monitor.shutdown.brokenDisk.pending:notice]: two data disks in RAID group "/aggr2_2/plex0/rg0" are broken. Halting system in 24 hours.
- Spare disks provided and reconstruction starts
- If there are dodgy disks in the raid group, reconstruction is not able to rebuild completely and start to mark missing blocks
Fri May 06 10:05:51 +0200 [node_2: raidio_thread: raid_multierr_bad_block_1:error]: params: {'disk_rpm': '10000', 'vendor': 'NETAPP ', 'firmware_revision': 'NA02', 'shelf': '2', 'disk_info': 'Disk /aggr2_2/plex0/rg0/0a.02.23P1 Shelf 2 Bay 23 [NETAPP X343_SSKBE1T8A10 NA02] S/N [WBN1AJT5NP001] UID [6000C500:BCA9B53B:500A0981:00000001:00000000:00000000:00000000:00000000:00000000:00000000]', 'volumeBno': '1348939177', 'site': 'Local', 'bay': '23', 'carrier': '', 'serialno': 'WBN1AJT5NP001', 'owner': '', 'model': 'X343_SSKBE1T8A10', 'disk_type': '4', 'blockNum': '81428969'}
Fri May 06 10:05:51 +0200 [node_2: raidio_thread: raid_multierr_bad_missingBlk_1:debug]: params: {'owner': '', 'rg': '/aggr2_2/plex0/rg0', 'blockNum': '81428969', 'vbn': '7381173545'}
- When client discover a damaged data it triggers an inconsistency alert
Sun May 15 18:14:30 +0200 [node_2: wafl_exempt01: wafl.raid.incons.userdata:error]: WAFL inconsistent: inconsistent user data block at VBN 3581364492 (vvbn:567776529 fbn:664341713 level:0) in public inode (fileid:96 snapid:0 file_type:15 disk_flags:0x8402 error:120 raid_set:1) in volume node_02_vol@vserver:6456a9ee-6e12-11e8-99f3-01b099c9ade9.
Sun May 15 18:14:30 +0200 [node_2: wafl_exempt01: wafl.incons.userdata.vol:alert]: WAFL inconsistent: volume vol_02_vol@vserver:6456a9ee-6e12-11e8-99f3-01b099c9ade9 has an inconsistent user data block. Note: Any new Snapshot copies might contain this inconsistency.