Why does packet loss impact performance?
Applies to
- All NetApp products
- TCP Communication
- CIFS, NFS, and iSCSI
Answer
- There are numerous reasons why packet loss can cause performance impact
- The goal of this article is to describe how packet loss typically causes performance issues, not the reason(s) that loss happens
- When packet loss is seen, congestion algorithms limit the amount of tcp data on the network to prevent further loss.
- The limitation set by this algorithm is the congestion window(cwin) and in a lossy network an average congestion window is used to see how much data is sent before packet loss.
- Badwidth Delay Product (BDP) associates this congestion window, with the round trip time, to give us an average expected throughput.
- When a packet is lost, the receiver stops responding to new packets until it is retransmitted, causing delays that could last as long as half a second.
- To see which packets Wireshark has flagged as a retransmission:
tcp.analysis.retransmission
- Some systems will flag a TCP flag called SACK (such as ONTAP), which can be used to identify how many packets are missing at a time.
- This Wireshark filter will let you see those packets:
tcp.options.sack.count > 0
- This Wireshark filter will let you see those packets:
Additional Information
-
How to determine packet loss and the possible reasons it could be occurring
-
Definitions:
-
-
The product of a data link's capacity (in bits per second) and its round-trip delay time (in seconds)
-
The result, an amount of data measured in bits (or bytes), is equivalent to the maximum amount of data on the network circuit at any given time, i.e., data that has been transmitted but not yet acknowledged
-
The bandwidth-delay product can be estimated by multiplying the ports link speed (in Bits per second) divided by 10, with the round trip time under load across the switch - typically in the order of around 1 millisecond: 40 Gbps / 10 ~= 4 GB/sec * 0.001 sec = 4.2 MB buffer memory
-
The round trip time includes not only the propagation delay of the wires, and the switch latency, but also any buffering within the switch, the host or the storage system while exchanging traffic
-
A switch that switches between different link speeds should provide buffer memory in this range on the participating ports.
-
-
-
-
The throughput of a communication is limited by two windows: the congestion window and the receive window
-
The congestion window tries not to exceed the capacity of the network (congestion control); the receive window tries not to exceed the capacity of the receiver to process data (flow control)
-
The receiver may be overwhelmed by data if for example it is very busy (such as a Web server)
-
Each TCP segment contains the current value of the receive window
-
If, for example, a sender receives an ack which acknowledges byte 4000 and specifies a receive window of 10000 (bytes), the sender will not send packets after byte 14000, even if the congestion window allows it
-
-
-
In TCP, the congestion window is one of the factors that determines the number of bytes that can be sent out at any time
-
The congestion window is maintained by the sender
-
Note that this is not to be confused with the sliding window size which is maintained by the receiver
-
The congestion window is a means of stopping a link between the sender and the receiver from becoming overloaded with too much traffic
-
It is calculated by estimating how much congestion there is on the link.
-
-
-
This is the amount of time needed for bytes to be sent by a sender, the receiver to acknowledge the bytes and the sender to receive the acknowledgement
-
Typically described in milliseconds (ms)
-
-
In ONTAP 9.1 and below (including Data ONTAP 8), or 9.5 and above, the netstat command will have a retransmit column.
-
In ONTAP 9.1 and below it is called
Retransmits
-
In ONTAP 9.5 and above, it is called
Rexmit
-
It can be useful to check for incrementing retransmits here as well (and may be faster than creating a trace, installing Wireshark, and viewing).