Why is the reported LUN latency higher than the volume latency?

Last updated

Aug 14, 2024
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 3,193

Visibility:: Public

Votes:: 6

Category:: fas-systems

Specialty:: perf

Last Updated:: 8/14/2024, 5:42:18 AM

Applies to

ONTAP 9
R2T (Ready to Transfer)
XFER_RDY (Transfer Ready)

Answer

It is often observed that the latency measured for iSCSI (and FCP) is significantly higher than that for the underlying volume, and the operation count at the volume level is higher than that measured on the contained LUNs.

Note: The Volume latency(WAFL latency) is only a subset of the LUN latency, so it’s the expected behavior to see a lower Volume latency compared to the LUN latency, this article focuses on the scenarios where LUN latency is significantly higher than Volume latency

Example:

A comparison of the Ops and Latency from the Volume layer and LUN layer for a client having 256KB read
iSCSI protocol as an example, but the same applies to FCP as well

LUN Latency

cluster1::*> statistics show -object lun -instance /vol/vol_linux/lun1 -counter read_data|avg_read_latency|read_ops -raw

Object: lun
Instance: /vol/vol_linux/lun1
Start-time: 10/26/2020 00:55:36
End-time: 10/26/2020 00:55:36
Scope: svm_linux

    Counter                                                     Value
    -------------------------------- --------------------------------
    avg_read_latency                                            215ms
    read_data                                                   85.25MB
    read_ops                                                    333

Volume latency

cluster1::*> statistics show -object volume -instance vol_linux -counter iscsi_read_data|iscsi_read_latency|iscsi_read_ops -raw

Object: volume
Instance: vol_linux
Start-time: 10/26/2020 01:00:10
End-time: 10/26/2020 01:00:10
Scope: svm_linux

    Counter                                                     Value
    -------------------------------- --------------------------------
    iscsi_read_data                                             85.25MB
    iscsi_read_latency                                          50034us
    iscsi_read_ops                                              1332

Note: Volume Ops is 4 times the LUN Ops, as LUN op size is 256KB, while the Volume Op size is only 64KB

The major reason for this significant latency difference is that at the WAFL/Volume layer, operation size is limited to 64KB.
If the client has to send an operation with a payload larger than 64KB, the payload has to be broken down into multiple Volume operations.
Because of this WAFL size limit, each iSCSI session or FCP login will negotiate settings, specifying the amount of data that can be sent in one single PDU(Protocol Data Units).
So the Volume latency only stands for the time it takes to handle one single 64KB PDU, but the LUN latency might be measuring the total handling time of several 64KB PDUs.
LUN latency(iSCSI latency or FCP latency) is measured from when the first PDU of the command is fully received, until the time the last PDU of the response is sent to the output queue of ONTAP.

ONTAP 9

ONTAP 9 has an optimization on this serialized way of handling for both Reads and Writes
- Up to 16 parallel 64KB PDUs could be handled at the same time, which means for most cases, the LUN latency shouldn’t be impacted by the Network Round Trip
  - It could be run in parallel, but it doesn’t necessarily mean all the PDUs would always run in parallel, for the same 256KB iSCSI write example:
    - The Minimum of the LUN latency equals to the 64KB PDU handling time(Volume latency)
    - The Maximum of the LUN latency equals to the sum of 4 * 64KB PDU handling time(Volume latency) + 3 Network Round Trips
    - LUN latency falls into the range between the Minimum and the Maximum
  - For some old versions of ONTAP 9, SAN Writes could NOT be run in parallel

With all these in mind, for ONTAP 9, there are two major scenarios where LUN latency is significantly higher than the LUN latency

PDUs are not run in parallel
- This will not cause any performance impact as long as the Volume latency is low
The network or the client is slow so it takes a long time for the R2T/XFER_RDY to be acknowledged and for the next PDU to arrive
- This will cause a performance impact even if the Volume latency is low

Notes:

In both scenarios, it should NOT be viewed as a storage performance issue as long as the Volume latency is still low
For FCP writes, the Network delay could also occur when the operation size is less or equal to 64KB. That's because FCP clients would send out the Write command first, then wait for the XFER_RDY from ONTAP to send out the data. So network delay could also play a role in this case

How to distinguish between these two scenarios?

qos statistics workload/volume latency show could be used to tell between these two scenarios:

Example:

cluster1::*> qos statistics workload latency show -workload vol_linux-wid22068
Workload            ID    Latency    Network    Cluster       Data       Disk        QoS       NVRAM      Cloud 
---------------    ------ ---------- ---------- ----------    ---------- ---------- ---------- ---------- ----------
Vol_linux-wid22068    -     210ms    8ms         1ms          181ms      18ms        0ms       2ms        0ms

PDUs are not run in parallel
- No significant latency is from Network
The network or the client is slow so it takes a long time for the R2T/XFER_RDY to be acknowledged and for the next PDU to arrive
- The majority of the high latency is from Network

Note:

In the QoS latency breakdown, Network stands for the latency introduced by the external components, such as Network or Clients
As long as there is no high latency from Network, there is no need to be concerned with the LUN latency even if it is significantly higher than the underlying Volume latency

How to address high LUN latency?

Please see the article How to address network latency in a SAN environment - Resolution Guide.

Additional Information

Why is the reported LUN latency higher than the volume latency in Data ONTAP 7-Mode?