Triaging block replication engine performance issues:
Data ONTAP products such as SnapMirror and volume move use the block replication engine (BRE) to transfer data from the source to the destination. At times, the products might not be performing as expected. This could be due various reasons, such as configuration issues, client issues, or a bug in one of the software modules including the transfer engine. This article looks at some of the factors that might contribute to low performance, and looks at tools to triage them. While this article focuses on BRE, many of the factors mentioned here also apply to LRSE.
Factors which can affect BRE performance:
The following are some of the many factors that can affect the performance of the transfer engine:
- Pre-existing load on the system
- Client load on the system
- Number of transfers/ volume moves being attempted simultaneously
- Type of transfer - initialize or update. This will correspond to the baseline or update phase of volume move, respectively.
- System configuration, including type of the storage system, type of disks, number of disks in the aggregate, or number of volumes in the aggregate
Keep in mind that these factors can apply to the source as well as the destination of the SnapMirror relationship or the volume move.
Additionally, users can ask why the network bandwidth is not saturated. This is because the transfer engine is not bottlenecked by the network bandwidth. The bottleneck is usually on the source storage system or the destination storage system. (In case the network is a bottleneck and is getting saturated, the user can enable Network Compression on the relationship.
The above-mentioned factors are discussed in more detail below:
Pre-existing load on the system:
The block replication engine reads blocks from the source aggregate, sends it over the network to the destination aggregate, and writes it to the destination aggregate. This requires spare CPU cycles to make reasonable progress. Since SnapMirror also uses the same transfer engine, the data from SnapMirror performance runs can be used to find the CPU utilization. For example, on a FAS6280, a single transfer consumes 25% of the available CPU on the source and the destination, and eight simultaneous transfers can consume 50% of the CPU on the source, and 55% of the CPU on the destination. Hence, the pre-existing load on the system can have a significant impact on the progress of the clients of BRE, such as volume move and SnapMirror. It is recommended that the controller CPU be kept below 50% utilization before performing SnapMirror transfers or performing volume moves. TR-4075 is the volume move best practices guide for DataMotion for Volumes Best Practices and Optimization
Data ONTAP internal load:
Data ONTAP internal traffic, not directly related to the client traffic, can also cause degradation in the TE performance.
The following includes some examples of this:
- Scanner traffic: Internal scanner traffic can generate system load and cause CPU load to increase, thereby reducing the amount of CPU available for the transfer engine. It can also cause the transfer engine's throttling mechanism to kick in, further reducing the performance.
- System Kahuna usage: Other system processes will not be able to run if the storage system is spending significant amounts of time in Kahuna. Thus, even if the overall CPU utilization is low but Kahuna usage is high, the TE workload might not get enough CPU cycles to run. This will reduce the TE throughput that can be achieved on that system.
SnapMirror and volume move are considered as background jobs. The underlying transfer engine backs off under the presence of client workloads. The transfer engine maintains a node-scoped pool of read and write tokens. These tokens limit the number of simultaneous messages that can be sent to WAFL by the transfer engine (across all transfers). The system keeps track of the number of ops/s sent to WAFL, and the average wait time for a message before being processed by WAFL (This is for all operations sent to AFL, excluding those sent by the TE itself). If these two values cross predefined thresholds, the transfer engine reduces the number of available tokens, effectively reducing the throughput available to its clients (both SnapMirror and volume move). For example, in clustered Data ONTAP 8.2 on a FAS6280, eight simultaneous transfers can achieve an aggregate throughput of 1134 MB/s on an unloaded system. This comes down to 289MB/s when there is 60K client IOPS on the source node, further to 206 when the IOPS goes up to 75K, and 76 MB/s when the IOPS goes up to 120K. These IOPS levels correspond to a CPU utilization of 66.5%, 75.8%, and 86.5% without any SnapMirror traffic, and 75.6%, 81.3%, and 86.5% with SnapMirror traffic, respectively. Therefore, client load on the system can have a significant impact on the throughput achieved by SnapMirror and volume move.
A side effect of this throttling mechanism is that the transfer engine backs off even in the presence of internal non-replication WAFL traffic. This is especially true when there is heavy scanner activity on the node. This can cause the throughput of both SnapMirror and the volume move to go down.
Number of simultaneous transfers:
The aggregate throughput achieved will depend on the number of simultaneous transfers. For the transfer engine, the throughput for a single transfer on a FAS6280 (large files) is 546 MB/s in Data ONTAP 8.2. This peaks to about 926 MB/s for eight simultaneous transfers. Conversely, when eight SnapMirror transfers (or volume moves) are started at the same time, each will individually achieve a throughput of ~118 MB/s. Thus, starting a large number of transfers simultaneously, even though the aggregate throughput might be high, will result in the throughput of the individual transfers being lower than performing just one transfer at a time.
SnapMirror is another client of the transfer engine, just like volume move. Running volume moves along with SnapMirror on the same node can slow down individual transfers, and the effect will be more pronounced if volume move and SnapMirror are using volumes on the same aggregate. The degradation will be similar to that described above.
Phase/type of the transfer:
SnapMirror has two phases: initialize and update. Similarly volume move performs a baseline transfer, followed by multiple update transfers. Baseline transfers usually have a higher transfer rate than update transfers. For example, on a FAS6280, a single stream baseline transfer can achieve a transfer rate of 546 MB/s, whereas the update transfer can achieve a transfer rate of 275 MB/s. Hence it is normal (and expected) to see the volume move throughput drop after the baseline transfer is completed.
System configuration can have a significant impact on the throughput achieved by volume move. The following are some of the factors to consider:
- Number of disks in an aggregate - if the aggregate does not have sufficient disks, read and write access to that aggregate will be slowed down. (Is there a document for the preferred number of disks to an aggregate?)
- A large number of volumes on the same aggregate can slow down access to a particular volume. This can occur even if that particular volume is not seeing a high change rate, but the other volumes in the same aggregate are.
- A bad or a failed disk in the aggregate can slow down access to the whole aggregate.
- Response time of SATA disks is more than that of SAS disks. In some cases, SATA disk response time can be twice the response time of SAS disks. This can have a significant impact on the performance of a volume move.
- Client traffic on the node, even if not directed to the volume in question, can cause the transfer engine to back off. This is because the internal backoff algorithm monitors statistics at a node level (since there is just one filesystem on the node), and not at a volume or aggregate level.
- An aged or a fragmented file system can increase the time it takes to read or write from the file system. This can be observed from statit, by looking at the read-chain-length and write-chain-length. Fragmented file systems will generally have smaller read and write chain lengths.
- WAN vs LAN. Replicating data over a WAN can have lower throughput as compared to replicating data over a LAN. Some of the reasons can be longer round trip delays, higher packet loss rate, and the reduced bandwidth available on a WAN. For example, in Data ONTAP 8.2 running eight concurrent SnapMirror initialize transfers on FAS6280s over an OC-12 link (622 Mbps) with 50ms round trip delay, an aggregate throughput of 73MBps is achieved. The corresponding throughput on a LAN is 926 MBps.
The results obtained in the Performance Lab are obtained under 'ideal' conditions, using SAS disks. There are no disk bottlenecks in this setup. Other than the tests which specifically measure the impact of client traffic, there is no client traffic on the node when the performance runs are done. Therefore, the performance of a particular setup might not match the throughput seen in the lab.
Improving BRE performance:
The ways to improve the performance of the TE, BRE in particular, is discussed in the following section:
- Schedule the transfers at times of low client load and low internal scanner traffic. This will prevent the TE's internal backoff mechanism from activation. However, the desired RPO has to be kept in mind.
- Size the system correctly. There should be enough CPU headroom to allow the CPU workload to execute (ideally 50% CPU headroom). Also keep in mind the maximum throughput that the platform can support. The change rate, and the amount of time available for replication to work, should be less than this time. As discussed in the section on System Configuration, the system has to be configured properly to achieve the maximum possible throughput. Details on setup can be found in SnapMirror Configuration and Best Practices Guide for Clustered Data ONTAP 8.2.
- Attempt defragmenting the file system if it is old or fragmented.
Note: This can lead to increased scanner traffic, and might actually reduce the TE throughput while the scanner is running.
Slow down the scanners, if they are causing the high Kahuna usage or causing the TE to backoff.
Note: This should be performed only under NGS supervision and as a temporary measure to alleviate a particular issue.
- Temporarily disable the TE's throttling mechanism. Run the following commands to perform this:
Note: This should be performed only under NGS supervision, and as a temporary measure to alleviate a particular issue.
On both the source and destination nodes, run the following command:
node run local -command 'priv set diag; setflag repl_throttle_enable 0; printflag repl_throttle_enable'
Make sure that you see '
repl_throttle_enable = 0x0'
To re-enable throttling:
node run local -command 'priv set diag; setflag repl_throttle_enable 1; printflag repl_throttle_enable'
Make sure that you see '
repl_throttle_enable = 0x1'
Another way to increase the transfer engine performance is to limit the client throughput. This can be performed, for example, for critical replication workloads if the client throughput is causing replication to back off excessively. The steps to limit the client throughput are listed in 'clustered Data ONTAP 8.2 System Administration Guide for Cluster Administrators', in the Section on 'Managing System Performance using Storage QoS'
If the replication traffic is going over a WAN, you can enable Network Compression for those SnapMirror policies. Network Compression help achieve better throughput when the network is the bottleneck. This feature is expected to be available from Data ONTAP 8.3 and later.
Data collection and debugging:
To root cause volmove slowness issues, obtain the perfstat data from the system. Perfstat can be run from an external source in the following format:
perfstat8 --verbose --time 5 --nodes=<node names>
This will collect the stats and save it to a local file. Run the
perfstat --help command for additional details.
The statit section of perfstat contains information about the following:
- CPU utilization and Kahuna CPU utilization.
- Disk utilization, disk response time, and disk read and write chain lengths.
The 'repl stopwatch counters' section contains the following histograms about the transfer engine performance:
writer_throttled: the number of times the writer has been throttled, and how long.
physdiff_throttled: the number of times the sender has been throttled, and for how long. This is z backoff inside the transfer engine. It does not include any network throttling.
buffer_read_wait_token: the amount of time the sender waited for a physdiff token before it could send a read message to WAFL.
buffer_wafl_read_blocks: the amount of time it takes to read a set of blocks (16 or 32 depending on the platform) on the sender.
The data and counters from perfstat can give an indication if the system is behaving as expected (inspite of being slow), or there is some other issue with the system.
To obtain some of the relevant counters at any point in time, run the following
statistics show commands:
statistics show -object repl_stopwatches -instance *throttled -raw
statistics show -object repl_stopwatches -instance buffer_read_wait_tokens -raw
statistics show -object repl_stopwatches -instance writer_data_waiting -raw
statistics show -object repl_stopwatches -instance buffer_wafl_read_blocks -raw
These counters show, respectively, if the system is being throttled, the histogram of read tokens on the source, the histogram of write tokens on the destination, and the histogram of container file read latencies on the source.
BURT filing guidelines:
If there still seems to be a performance issue with the system, file a BURT and include the perfstat collected above with the burt. Run
paloma_info and include the output in the burt directory. This will help identify if there are errors or other significant even that might be causing the slowness. Include the following additional information with the burt:
Cluster information: output of
cluster peer show.
The source volume name, the physical node where it resides, that node's partner physical node (if HA is configured) and the error file (complete name like
snapmirror_audit.123) from which the error/log was obtained. This will save time scanning through a large number of files on multiple nodes.
The destination volume/aggregate name, the physical node on which this destination resides, and its partner physical node.
If cascade, the third-leg information in a similar format as above.
Network configuration. If there is a WAN component, then the setting of the WAN link (throughput, packet loss, delay).
SnapMirror configuration: number of relations, how the relationships are distributed between the nodes, fan-in/fan-out, cascading.
Any other operations going on at that time: vol moves, ARL, SFO events.
Any other settings: auto delete of snapshots and throttle.