What are the important changes to RAID scrub in Data ONTAP 8.3.x or later
Applies to
ONTAP 9
Answer
- General higher CPU and Disk utilization may be observed, especially during night hours.
- A possible reason can be the change to the RAID scrub schedule introduced in Data ONTAP 8.3.
- The default RAID scrub schedule changed in Data ONTAP 8.3 - scrubs are run every day.
- For more information for
raid.scrub.schedule
, refer to Storage raid-options Commands.- Note: If no specific value is defined, the default schedule will apply.
- For more information for
- The default RAID scrub schedule changed in Data ONTAP 8.3 - scrubs are run every day.
-
This option specifies the weekly schedule (day, time, and duration) for scrubs started automatically.
-
On a non-AFF system, the default schedule is daily at 1 a.m. for the duration of 4 hours except on Sunday when it is 12 hours
-
On an AFF system, the default schedule is weekly at 1 a.m. on Sunday for the duration of 6 hours
-
- By default, scrub will run 4 hours every day, thus the overall scrub runtime will be higher and scans will complete more frequently compared to prior ONTAP 8.3 versions of ONTAP.
- It is expected behavior that the system will have higher CPU and disk activity during this time.
- If this is an issue during the week, the schedule can be defined to run at specific times and for specific durations.
Example:
- Use the
storage raid-options show
command to check the current settings:
cluster::> storage raid-options show -name raid.scrub.schedule
Node Option Value Constraint
-------- ------------------------------------- ------------ -----------
cluster-01 raid.scrub.schedule none
cluster-02 raid.scrub.schedule none
2 entries were displayed.
- Use the
storage raid-options modify
command to change the schedule as required:- NOTE
- The following command schedules two weekly RAID scrubs.
- The first scrub is for 240 minutes (four hours) every Tuesday starting at 2 a.m.
- The second scrub is for eight hours every Saturday starting at 10 p.m.
- NOTE
cluster::> storage raid-options modify -node cluster-01 -name raid.scrub.schedule 240m@tue@2,8h@sat@22
Specified scrub schedule added
- With the
storage raid-options show
command, you can verify the change to the schedule:
cluster::> storage raid-options show -name raid.scrub.schedule
Node Option Value Constraint
-------- ------------------------------------- ------------ -----------
cluster-01 raid.scrub.schedule 240m@tue@2 none
cluster-02 raid.scrub.schedule none
2 entries were displayed.
Verification through the Event log:
- You can also verify and search for the related messages using the
event log show
command:- NOTE: ONTAP 9 requires promoting to diag mode by
set diag
command.
- NOTE: ONTAP 9 requires promoting to diag mode by
- In this example, starting at 1 AM as per the default schedule.
Cluster-01::> event log show -messagename raid.rg.scrub.resume
[?] Tue May 24 01:00:12 CEST [cluster: config_thread: raid.rg.scrub.resume:notice]: /aggr_ssc_dc1_ds11_b_sata_root/plex0 /rg0: resuming scrub at stripe 578657472 (89%% complete)
- To check for the pausing of scrub, search for a suspend message:
- In this example, it suspends at 5 AM after 4 hours runtime, as per the default schedule.
Cluster-01::> event log show -messagename raid.rg.scrub.suspend
[?] Tue May 24 05:00:01 CEST [cluster: config_thread: raid.scrub.suspended:notice]: Disk scrub suspended.
- To check for the summary, run:
Cluster-01::> event log show -messagename raid.rg.scrub.summary
[?] Tue May 24 05:00:01 CEST [cluster: config_thread: raid.rg.scrub.summary.lw:notice]: Scrub found 0 RAID write
signature inconsistencies in /aggr_ssc_dc1_ds11_b_sata_data_01/plex0/rg0.
Additional Information
statit
will show greads during this time, and the disks may show 100% busy, but this is normal due to RAID scrubs being background activity at disk level:
Cluster::> set advanced Cluster::*> node run -node node_1 -command statit -b Cluster::*> node run -node node_1 -command statit -e ... Disk Statistics (per second) ut% is the percent of time the disk was busy. xfers is the number of data-transfer commands issued per second. xfers = ureads + writes + cpreads + greads + gwrites chain is the average number of 4K blocks per command. usecs is the average disk round-trip time per 4K block. disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs /data_aggr1/plex0/rg0: 0a.00.4 79 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 60 0.00 .... . 0a.00.18 84 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 67 0.00 .... . 0a.00.10 82 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 64 0.00 .... . 0a.00.19 87 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 71 0.00 .... . 0a.00.12 86 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 69 0.00 .... . 0a.00.17 90 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 75 0.00 .... . 0a.00.16 91 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 77 0.00 .... . 0a.00.2 91 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 76 0.00 .... . 0a.00.3 92 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 78 0.00 .... . 0a.00.5 94 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 82 0.00 .... . 0a.00.6 95 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 85 0.00 .... . 0a.00.7 95 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 87 0.00 .... . 0a.00.13 96 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 89 0.00 .... . 0a.00.15 96 280.18 0.00 .... . 0.00 .... . 0.00 .... . 280.18 64.00 92 0.00 .... .