Overview of Wafliron for Data ONTAP 7
Applies to
- Data ONTAP 7
- Data ONTAP 8 7-mode
Answer
The following aspects of wafliron are addressed in this article:
What is wafliron?
Wafliron is a Data ONTAP (R) tool that will check a WAFL (R) file system for inconsistencies and correct any inconsistencies found. It should be run under the direction of NetApp Technical Support. Wafliron can be run on a traditional volume or an aggregate. When run against an aggregate, it will check the aggregate and all associated FlexVol (R) volumes. It cannot be run on an individual FlexVol volume within an aggregate.
Wafliron can be run with the storage system online, provided that the root volume does not need to be checked. When run, wafliron performs the following actions on the traditional volume or in the aggregate and associated FlexVol volumes:
- Checks file and directory metadata
- Scans inodes
- Fixes file system inconsistencies
What impact can wafliron have on data availability?
Starting a wafliron will cause the aggregate and associated FlexVol volumes (or the traditional volume) to be un-mounted and then remounted once several baseline checks are completed. During this first phase, clients will be unable to access the affected volumes. Once the wafliron is started, it cannot be aborted until the mounting phase is complete. During this first phase, the console may be unresponsive. This first phase can take a substantial amount of time to remount the aggregate and FlexVol volumes.
WARNING: If the aggregate is not wafl_inconsistent, prepare for downtime when performing the wafliron. |
The time the storage system will spend in the first phase is difficult to estimate due to the number of contributing factors on the aggregate and FlexVol volumes. These factors include:
- The number of SnapshotTM copies
- The number of files
- The size of the aggregate/volumes
- The RAID group size
- The physical data layout
- RAID reconstructions occurring
- The number of LUNs in the root of the volume
It is not unusual for a 1 TB aggregate to take three or more hours to mount. Specific times vary, but for large aggregates/volumes, NetApp recommends planning a downtime window.
Once the aggregate and associated FlexVol volumes are mounted, data will be served while the wafliron continues to check the data. For an aggregate, all FlexVol volumes must be mounted before data can be served from any of the FlexVol volumes in the aggregate. If the aggregate contains a FlexVol volume with LUNs, then all LUNs within that FlexVol volume must complete their Phase 1 checks before any LUN in that FlexVol volume can be brought online.
Note: Prior to Data ONTAP 7.3, all volumes within an aggregate needed to complete Phase 1 before any volume was accessible. This behavior changed in Data ONTAP 7.3. For more information on prioritizing volumes, see section 'Can you prioritize which volumes wafliron checks first? '
The vol status
command can be used to monitor whether the volumes have been remounted. If the volumes are still in the mounting phase, vol status
will show:
storage1> vol status
Volume State Status Options
vol0 online raid4, trad root
vol status: Volume 'tst' is temporarily busy (vol remount).
vol status: Volume 'vol1' is temporarily busy (vol remount).
What is the difference between wafliron and WAFL_check?
WAFL_check and wafliron are both diagnostic tools used to check WAFL file systems. Wafliron will make changes as it runs and records these changes in the storage system's messages file. The administrator has no choice over which changes wafliron will commit.
Prior to Data ONTAP 7.3.1, wafliron will make changes as it runs and records these changes in the storage system's messages file. The administrator has no choice over which changes wafliron will commit.
In Data ONTAP 7.3.1, optional commit has been added to wafliron. This feature allows wafliron to check the affected aggregate but not commit changes until the storage administrator approves them. For more information, see section 'How is wafliron run with Optional Commit?'
If WAFL_check is run, the administrator can choose whether or not to commit changes.
Wafliron can run while the storage system is online and serving data from volumes/aggregates not being checked. If it is started from the advanced privilege mode, wafliron will allow access to the data in the aggregate once it completes its baseline checks. If it is started from the Special Boot Menu, the storage system will automatically boot and start serving data once the baseline checks are complete.
WAFL_check, however, must be run from the Special Boot Menu and the storage system will not be serving data until the WAFL_check completes and the administrator chooses to commit changes.
WARNING: NetApp Technical Support should always be consulted before running either wafliron or WAFL_check |
What are the phases of wafliron?
Wafliron has three phases to check Aggregate and Volumes.
Note: Wafliron is a diagnostic tool, and its usage and output is subject to change.
Phase 1
- Verifies file system access by checking the necessary metadata. This includes checks of the aggregate metadata associated with each FlexVol volume contained in that aggregate, metadata tracking free space, and Snapshot copy sanity.
- Phase 1 will check the aggregate first and then each FlexVol volume on that aggregate. After all FlexVol volumes within the aggregate are checked, the aggregate and FlexVol volumes will be mounted.
- The only status provided during this phase is a message to console logging the start of wafliron. The progress cannot be monitored during this phase.
WARNING: LUNs will not be available until Phase 1 completes. LUNs may not be automatically set to an online state. For more information, see section 'Why are LUNs still offline after wafliron phase 1 completes?' |
WARNING: Snapshot copies are read-only and therefore cannot be modified by wafliron. If a Snapshot copy contains an inconsistency, the Snapshot copy will need to be deleted in order to remove the inconsistency from the file system. Always contact NetApp Support before deleting a Snapshot copy that is suspected to contain an inconsistency. |
Phase 2
- Verifies the metadata for user data. If a user requests data that has not yet been checked, wafliron will check and repair it (if necessary) on-demand. Due to this on-demand checking, users may see increased latency during this phase.
- In Data ONTAP 7.2.3 and later,
aggr wafliron status -s
will provide progress for the wafliron.
Phase 3
- Performs clean-up tasks such as finding lost blocks/files and verifying used blocks.
- In Data ONTAP 7.2.3 and later,
aggr wafliron status -s
will provide progress for the wafliron.
What should be done prior to running wafliron?
Before running wafliron, the cause of the file system inconsistency should be corrected. If the inconsistency was caused by FC-AL loop instability or errors, loop testing should be performed to isolate the problem. NetApp FC-AL diagnostics can be used for troubleshooting.
Note: wafliron should not be run at the same time or immediately following any aggregate transition events. It is recommended to wait 8 minutes after a storage failover (takeover or giveback) or aggregate relocation before beginning wafliron on those aggregates.
In order to run wafliron, the following conditions must apply for the aggregate or traditional volume:
- RAID must be in an online or restricted/degraded state.
- The WAFL file system must be mounted.
- The file system may be wafl_inconsistent.
The above conditions can be checked by running the aggr status -r
or vol status -r
commands.
Example 1: Online aggregate that is mounted
Wafliron can be run on this aggregate.
storage1> aggr status -r aggr0
Aggregate aggr0 (online, raid_dp) (block checksums)
Example 2: Restricted aggregate that is mounted but wafl_inconsistent
Wafliron can be run on this aggregate.
storage1> aggr status -r aggr1
Aggregate aggr1 (restricted, raid_dp, reconstruct, degraded, wafl inconsistent) (block checksums)
Example 3: Restricted aggregate that is un-mounted
Wafliron cannot be run on this aggregate.
storage1> aggr status -r aggr2
Aggregate aggr2 (unmounting, raid4, reconstruct, wafl inconsistent) (block checksums)
Example 4: Failed aggregate
Wafliron cannot be run on this aggregate.
storage1> aggr status -r aggr2
Aggregate aggr2 (failed, raid_dp, partial) (block checksums)
How do you run wafliron on a non-root aggregate or traditional volume?
To start wafliron on aggregates other than the root:
storage1> priv set advanced
storage1*> aggr wafliron start [aggr_name | volname]
Note: After running the command above, the storage system console may become unresponsive for a period of time. The storage system should be monitored for at least thirty minutes following the start of the wafliron. If the console is still unresponsive after this time, NetApp Technical Support should be contacted.
Can wafliron be run on a root aggregate/volume?
Wafliron can be run on a root aggregate/volume. However, it cannot be done with the storage system booted. This limitation is due to several factors such as:
- If the WAFL file system for a root aggregate/volume on a storage system is inconsistent, the storage system will be unable to boot.
- If the root aggregate/volume is not inconsistent and wafliron is started, wafliron would need to unmount the root aggregate/volume to perform its baseline checks. Since the root aggregate/volume must be online and available for the storage system to be operational, wafliron would be unable to do this.
Because of these factors, wafliron can only be started on a root aggregate/volume from the Special Boot Menu.
WARNING: If wafliron needs to be run on an aggregate containing the FlexVol root volume or on a traditional root volume, downtime must be scheduled for the storage system. However, this downtime can be minimized by running wafliron from the Special Boot Menu. When wafliron is run from the Special Boot Menu, it will perform some preliminary checks and corrections and then automatically boot the storage system. Once the storage system is booted, data will be available in the affected volumes while the wafliron continues to complete its checks and make any necessary changes. |
To run wafliron on a root aggregate/traditional volume, the storage system must first be booted to the Special Boot Menu using the following steps:
- Reboot or boot the storage system.
- During the boot process, when prompted to "Press CTRL-C for Special Boot Menu" press CTRL-C. A five-item menu appears.
- At the "(1-5)" prompt, enter the hidden command wafliron.
WARNING: Prior to Data ONTAP 7.3, the above steps will initiate a wafliron on all aggregates and FlexVol volumes. This will cause the storage system to initiate the first phase of the wafliron and then boot Data ONTAP. Note that the filer will boot significantly slower when performing this task. Once Data ONTAP boots, wafliron will be running on all volumes. |
For Data ONTAP 7.3 and later, if wafliron is started from the Special Boot Menu, it will only check the root aggregate. All other aggregates can only be checked using wafliron from within Data ONTAP.
Can wafliron be run on a deduplicated (SIS) volume?
WARNING: NetApp Technical Support must be contacted prior to running wafliron on an aggregate containing deduplicated FlexVol volumes. |
Before attempting wafliron, the storage system must be net-booted to Data ONTAP version 7.2.4P5D6 as this version includes critical fixes for wafliron when run against deduplicated volumes.
Can wafliron be run on a volume used by SnapMirror or SnapVault?
Wafliron can be run on a volume used by SnapMirror(R) or SnapVault(R). However, some limitations apply depending on the SnapMirror/SnapVault configuration.
- If the volume is the source for a Volume SnapMirror or contains source qtrees for a Qtree SnapMirror or SnapVault:
- Since the source of a SnapMirror or SnapVault is read/write, wafliron can be run using the same command as used on a regular aggregate:
storage1> priv set advanced
storage1*> aggr wafliron start [aggr_name | volname]
WARNING FOR AGGREGATES CONTAINING SNAPMIRROR SOURCE VOLUMES: After wafliron completes (or is aborted) on an aggregate, a metafile scan will occur on the FlexVol volumes as a background process. All SnapMirror functions will be delayed until after the scan(s) are complete for the FlexVol volumes that were ironed. The time to completion grows linearly with number of inodes / blocks used, and number of Snapshot copies in the volume. This means that the time can be on the order of hours or days before SnapMirror replications can resume for the FlexVol volumes that were ironed. Progress of the scan can be monitored with 'wafl scan status'. |
- If the volume is a Volume SnapMirror destination or contains destination qtrees for SnapVault or Qtree SnapMirror:
- Because wafliron makes necessary changes to an inconsistent file system as it runs, it cannot be run against a read-only volume. Thus, the SnapMirror destination volume must be writable before wafliron can be run. In order to make the volume/qtrees writeable when running wafliron, the "-f" option must be used. This option enables wafliron to be started on a volume containing Qtree SnapMirror/SnapVault destination volumes by breaking all SnapMirror/SnapVault relationships to the destination qtrees.
- The command is:
storage1> priv set advanced
storage1*> aggr wafliron start [aggr_name | volname] -f
WARNING: Once wafliron is complete, the SnapMirror and SnapVault relationships will be in a broken-off state. In order to resume updates, the relationships must be resynchronized or re-initialized. Depending on the number of changes made by wafliron, it may not be possible to resynchronize the SnapMirror/SnapVault relationship. Additionally, not all versions of Open Systems SnapVault (OSSV) will support resync. |
- For more information on resynchronizing a SnapMirror or SnapVault relationship, refer to the Data ONTAP 8.1 Data Protection Online Backup and Recovery Guide for 7-Mode.
- For more information on resynchronizing an OSSV relationship, refer to the Open Systems SnapVault 3.0.1 Release Notes and OSSV 3.0.1 Installation and Administration Guide.
WARNING: Running wafliron with Optional Commit on a SnapMirror/SnapVault destination will result in the SnapMirror/SnapVault relationships being automatically broken if wafliron changes are committed. |
-
After wafliron is run on a destination volume for Volume SnapMirror, a "block type initialization" scan must be performing on the traditional/FlexVol volume that was checked and modified by wafliron. Until this scanner completes, volume SnapMirror relationships cannot be re-synchronized, updated, or initialized. This behavior is being tracked as BUG 142586, which is first fixed in Data ONTAP 7.0.6, 7.1.2, and 7.2.2. The "block type initialization" scan may take several days to complete depending on the size of the FlexVol volume and the load on the storage appliance. To check the status of the command, run the
wafl scan status
command in the advanced privilege mode:
storage1> priv set advanced
storage1*> wafl scan status
Volume sm_dest:
Scan id Type of scan progress
1 block type initialization snap 0, inode 58059 of 30454809
Can wafliron be run on a SnapLock aggregate/volume?
Wafliron can be run on both SnapLock(R) Compliant and SnapLock Enterprise volumes and aggregates. However, SnapLock Compliant volumes have some restrictions that may prevent wafliron from functioning properly. NetApp Technical Support should always be consulted before starting wafliron on a SnapLock Compliant aggregate/volume.
Can wafliron be run on a 64-bit aggregate?
Data ONTAP 8.0 7-Mode includes a new aggregate type called 64-bit aggregates. If a 64-bit aggregate is marked inconsistent, wafliron and wafliron with optional commit can be run on the aggregate and all associated FlexVol volumes. Contact NetApp Support for assistance before starting any file system checks.
Can wafliron be run on a striped aggregate?
Yes. Wafliron may be run on either a member aggregate which is part of a striped file system, or it may also be run on the aggregate which contains the MDV (Meta-Data Volume). Contact NetApp Support for assistance before starting any file system checks.
WARNING: WAFL_check is not to be run on any member aggregates within a striped file system. Only wafliron is to be used. |
How is wafliron run with Optional Commit (IOC)?
For more information regarding this functionality, contact NetApp Technical Support
WARNING: Wafliron run with Optional Commit will not permit access during the file system checks to the aggregate or any of its associated volumes being checked using optional commit mode. Other aggregates/volumes not undergoing a wafliron will be accessible. |
WARNING: Running wafliron with Optional Commit on a SnapMirror/SnapVault destination will result in the SnapMirror/SnapVault relationships being automatically broken if wafliron changes are committed. |
Can you prioritize which volumes wafliron checks first?
Starting in Data ONTAP 7.3, you can choose the priority of the FlexVol volumes on the aggregate being checked by wafliron. Volumes selected to be checked first will be mounted before the other volumes. This feature allows critical volumes to be available for data access before other less critical volumes.
To set priority, the '-v' flag can be used. The complete syntax is:
storage1> priv set advanced
storage1*> aggr wafliron start-v
If multiple FlexVol volumes are specified, they are checked in order. If a FlexVol volume on the aggregate is not listed, then it will be checked after all FlexVol volumes specified in the command are checked.
WARNING: Several exceptions apply to FlexVol volume prioritization:
|
How can you check the status of a wafliron?
During Phase 1, it is not currently possible to check the status of wafliron. Status can be seen in Phase 2 and Phase 3. Data ONTAP 7.2.3 added enhancements to check the status during these phases of wafliron.
For Data ONTAP 7.2.3:
To check the status of wafliron during Phase 2 or Phase 3, the advanced level command aggr wafliron status -s
can be used. It will produce output similar to the following:
storage1> priv set advanced
storage1*> aggr wafliron status -s
Total mount phase of aggr aggr1 took 5s.
Rootdir mount phase of aggr aggr1 took 3 msecs.
Activemap mount phase of aggr aggr1 took 199 msecs.
Snapshot mount phase of aggr aggr1 took 2567 msecs.
Refcnt mount phase of aggr aggr1 took 2526 msecs.
Metadir mount phase of aggr aggr1 took 18 msecs.
Flex vols mount phase of aggr aggr1 took 75 msecs.
Wafliron scan stats for volume: flexvol3
24 files done
7 directories done
598773 inodes done
45092 blocks done
Wafliron scan stats for volume: flexvol1
44 files done
14 directories done
199685 inodes done
11282 blocks done
wafliron is active on aggregate: aggr1
Scanning (16% done).
Prior to Data ONTAP 7.2.3:
To check the status of the wafliron on the aggregate or traditional volume:
storage1> priv set advanced
storage1*> aggr wafliron status [aggr_name | volname]
Example:
storage1*> aggr wafliron status
wafliron is active on volume: vol1
Scanning (2% done).
wafliron is active on aggregate: aggr0
Scanning (0% done).
By default, wafliron information is logged to the storage system's console as well as the /etc/messages file. The messages logged include wafliron start time, changes made, a summary of the changes, and the completion time for the aggregate and all FlexVol volumes.
To check the progress of a wafliron on a FlexVol volume residing on the aggregate being ironed:
storage1> priv set advanced
storage1*> wafl scan status volname
Example:
storage1*> wafl scan status vol1
Volume vol1:
Scan id Type of scan progress
158 wafliron demand 156003 (156597/156595) of 3640875
Once the wafliron is complete, the storage system should be returned to normal administrative mode using the following command:
storage1*> priv set admin
Can wafliron be stopped?
WARNING: Wafliron should never be stopped unless instructed to do so by NetApp Technical Support. |
It is possible to stop wafliron during its Phase 2 checks.
WARNING: If wafliron is stopped during Phase 2, it must be restarted at the beginning. It cannot be resumed from the point at which it was stopped. |
To stop a wafliron on an aggregate or traditional volume:
storage1> priv set advanced
storage1*> aggr wafliron stop [aggr_name | volname]
Wafliron cannot be stopped during Phase 3.
What happens if the storage system is power cycled or rebooted while wafliron is running?
In all cases, wafliron will need to be started using the aggr wafliron start
command. However, the point at which the wafliron starts following the power cycle depends on what phase the wafliron was in at the time of the shutdown:
- If wafliron is in Phase 1 (mounting) and is interrupted, the wafliron will start from scratch.
- If wafliron is in Phase 2 - 3, any changes committed to the point of shutdown will be retained, but wafliron will start again in Phase 1 following the
aggr wafliron start
command.
How does wafl scan speed affect wafliron?
The rate at which WAFL will perform scans such as wafliron is governed by the WAFL scan speed. By default, this speed is dynamically set (with a value of 0). Prior to Data ONTAP 7.0.5, the default is 2000. The speed is automatically tuned by Data ONTAP based on available system resources. It can also be set manually using an advanced command.
WARNING: Manually increasing the WAFL scan speed value from the default will allow the scanners to run quicker, but it may cause a negative performance impact to the storage system as more system resources will be required by the WAFL scanners. |
This value should never be changed except under the direction of NetApp Technical Support.
Will wafliron cause any performance impact?
Once Phase 1 completes, data will be served while wafliron is running on any volume, including the root volume. However, the performance penalty of running wafliron varies depending on many factors such as:
- The file system structure
- Client access patterns
- Storage system load
- Storage system platform
- Available memory
- The extent of inconsistencies in the WAFL file system
One reason for the performance penalty is that when clients access data, wafliron must first check the data before fulfilling clients' requests. This behavior ensures the clients receive consistent data and prevents the storage system from panicking should clients touch inconsistent data. If the storage system's load is heavy due to client requests, it is recommended that the administrator plan for a high performance penalty, although the actual impact may be less.
How can wafliron be run on a pre-Data ONTAP 7G storage system?
Prior to Data ONTAP 7G, only traditional volumes were available. As such, the vol wafliron start
command must be used to initiate wafliron.
Why are LUNs still offline after wafliron phase 1 completes?
When an aggregate is marked inconsistent, the FlexVol volumes and LUNs will go offline until the file system is checked. If the NVFAIL option is enabled, the LUNs will not be brought online automatically when the FlexVol volume is brought online after wafliron Phase 1 checks. This is expected behavior. Once the volume is online, the storage administrator will need to manually online the LUNs individually. NetApp highly recommends monitoring system performance using sysstat while bringing the LUNs online.
Note: During the LUN online work, the sysstat may show the filer CPU pegged at 100%. This is not necessarily an indication of a problem.
Additional Information
additionalInformation_text