Purpose of Ethernet Flow control:
Ethernet Flow Control is a mechanism that allows a network device that is experiencing a performance bottleneck to request that the neighboring device stop transmitting. Ethernet Flow control defines a type of Ethernet packet usually referred to as a 'PAUSE' frame. They can only be exchanged between devices that are directly connected, and are defined only in the Ethernet (DataLink) layer. Therefore, Ethernet PAUSE frames are not associated with, and cannot be associated with IP, TCP, NFS, CIFS, or any other higher level protocols.
Without Ethernet Flow control, if the receiving device is not able to process information as fast as the transmitting device is sending it, it would eventually have to start discarding incoming data. This forces the transmitting Client to have to retransmit, and can cause painful delays for a number of reasons.
With Ethernet Flow Control, the recipient can request that the transmitting device 'pause' for a brief time, attempting to avoid these discards. In some cases, this allows for a more efficient network transfer. Ethernet Flow control, however, is not always effective at preventing performance issues. In certain (usually rare) circumstances, Ethernet Flow control can cause more of a performance impact than would have been caused by the re-transmissions they were designed to prevent.
Note: Ethernet Flow Control is going to be common on 1 Gbps or faster equipment. It may also be found on more modern 100 Mbps equipment, even though the Ethernet Flow Control specification was not intended to be supported there.
Assume that we have a basic network with a Storage controller, switch, and client machine. Further, assume that the Storage Controller is configured to SEND Ethernet Flow Control packets, and the switch is configured to RECEIVE (listen to) Ethernet Flow control packets on the interface that the Storage Controller is plugged into. Finally, assume that the Client is not configured to support Ethernet Flow Control at all. Note that the switch has a setting for Ethernet Flow Control for each interface (one setting for the interface that the Storage Controller is plugged into, a potentially different setting for the interface that the Client is plugged into).
In this scenario, the Storage Controller can now send PAUSE frames to the switch. If the Storage Controller reaches a point at which it cannot process information as fast as the switch is sending it, the Storage Controller is now allowed to send a PAUSE frame (Xoff) to the switch. The switch, when it receives that PAUSE frame, is expected to stop transmitting. The switch should now start to hold (queue) traffic that needs to be sent to the Storage Controller for a brief period of time. The Storage Controller can now process the data that it had previously received.
Each PAUSE/Xoff frame contains a value (called a 'Quanta') that requests the neighboring device to stop transmitting for a specific period of time. The transmitter (switch, in this example), is expected to resume transmitting when this time has expired, or when it receives a different Ethernet Flow Control packet from the receiver (Storage Controller in this example). Based on the IEEE standards, the switch is expected to evaluate the QUANTA value, and stay 'paused' based on a combination of the QUANTA value and the speed of the interface. The amount of time that an interface is expected to stay 'paused', however, is dependent on the firmware of the transmitting device and may not follow this rule.
Calculating the amount of time an interface is expected to remain Paused:
PAUSE frames, or Xoff frames, appear in the output of the
ifstat -a -v command.
In 10Gbps, the
ifstat –a command might list 'PAUSE' frames; in 1Gbps, it will typically be listed as Xoff and Xon.
A PAUSE frame includes the period of pause time being requested, in the form of a two byte unsigned integer (0 through 65535) known as a 'quanta'. This number is the requested duration of the pause.
- Each 'quanta' is equal to 512 bit times.
- A 'bit time' is defined as the number of seconds it takes to send a bit of data. Since an interface is rated in 'bits per second', the 'bit time' is the inverse of the interface line speed.
For a 1Gbps (1,000,000,000 bps) interface, the 'bit time' is 1/1,000,000,000 seconds per bit
For a 1Gbps link, the maximum amount of time an interface is expected to stop transmitting is:
maximum quanta value * 512 / (10^9) = 65535*512/1000000000 = 0.03355392 seconds = 33.55ms.
For 10Gbps, the maximum amount of time an interface is expected to stop transmitting is:
65535*512/10^10 = 3.355ms
For 100Mbps interfaces that support the protocol, the 100Mbps interface would be expected to PAUSE for up to 335.5ms.
Devices or network elements can be configured to both send and receive PAUSE frames. Therefore on a 1Gbps interface, if the Storage Controller lists an Xoff (or PAUSE) in the 'Transmit' statistics, the Storage Controller is instructing the switch port to PAUSE, then the Storage Controller should stop receiving packets for up to 33.5ms.
If the Storage Controller lists an Xoff (or PAUSE) in the 'Receive' statistics, the switch is instructing the Storage Controller to stop sending packets, then the Storage Controller should stop transmitting on that interface for up to 33.5ms.
If the 'paused' port receives an Xon packet before the timer expires, the port can immediately resume transmitting.
Calculating the potential impact:
-- interface e0b (0 hours, 2 minutes, 21 seconds) --
Frames/second: 4655 | Bytes/second: 416k | Errors/minute: 0
Discards/minute: 0 | Total frames: 186k | Total bytes: 31954k
Total errors: 0 | Total discards: 0 | Multi/broadcast: 0
No buffers: 0 | Non-primary u/c: 0 | Tag drop: 0
Vlan tag drop: 0 | Vlan untag drop: 0 | CRC errors: 0
Runt frames: 0 | Fragment: 0 | Long frames: 0
Jabber: 0 | Alignment errors: 0 | Bus overruns: 0
Queue overflows: 0 | Xon: 1326 | Xoff: 1326
Jumbo: 0 | Reset: 0 | Reset1: 0
Reset2: 0 | TBI mode: 0 | Pad odd: 0
Pad even: 0
Frames/second: 2993 | Bytes/second: 4009k | Errors/minute: 0
Discards/minute: 0 | Total frames: 224k | Total bytes: 285m
Total errors: 0 | Total discards: 0 | Multi/broadcast: 2
Queue overflows: 0 | No buffers: 0 | Frames queued: 0
Buffer coalesces: 0 | MTUs too big: 0 | Max collisions: 0
Single collision: 0 | Multi collisions: 0 | Late collisions: 0
Timeout: 0 | Xon: 0 | Xoff: 0
Current state: up | Up to downs: 0 | Auto: on
Status interrupt: 0 | Speed: 1000m | Duplex: full
The statistics collected cover a total of 141 seconds (2 minutes, 21 seconds).
- 1326 Xoff frames were received.
- Each Xoff frame might have stopped the transmission for up to 33.5ms.
- Each Xon would release the storage system port to resume transmitting. However, it is not possible to determine whether the Xon was received 0.1ms after the Xoff, 1ms after the Xoff, or 33.4ms after the Xoff.
Therefore, the maximum *potential* amount of time that all transmissions from the storage system to the network are held would be:
1326 * 33.5ms = 44,421ms = 44.4 seconds.
When considering the ‘percentage of packets’, it might appear trivial: 1326/224000 = 0.7%
However, note that the PAUSE frames received only have an effect on the frames transmitted (by temporarily halting transmissions that the recipient would have otherwise sent). The port can continue to receive from the switch as fast as the switch chooses to send them. A more accurate percentage calculation would have been <packets sent>/<packets that would have been sent if traffic was not PAUSED>. However, there is no way of determining the number of packets that would have been sent if traffic was not PAUSED.
On considering the amount of time that the port might have been prevented from transmitting, the perspective is very different, and potentially more relevant:
44.4 seconds / 141 seconds = 31%
Therefore, for as much as 31% of the time during this 141 second interval, clients would have been unable to receive responses from this port on the storage system.
Of course, it is also possible that each ‘Xoff’ was immediately followed by an ‘Xon’, in which case the transmit traffic would have been stopped for only a very brief period of time. It is likely that this would have remained unnoticed.
Therefore, in the case of these statistics, the impact of 1326 PAUSE frames would range somewhere between ‘none’ and 44.4 seconds. As the Flow Control PAUSE frames are aimed at, and operated on by the individual port, the PAUSE frame is not passed to the upper protocol layers. PKTT will not contain the PAUSE frames. Most port-mirror traces from the switch will also not contain any of the PAUSE frames. Only an ‘on the wire’ packet capture would reveal, with certainty, the true impact.
The PAUSE frames are an indication that the device sending the PAUSE frames is experiencing issues. If the calculations confirm that the maximum amount of time a port might have been paused is a significant portion of the time over which the data was captured, then consider investigating why the device sending the PAUSE frames is experiencing difficulty.
If the maximum amount of time that the PAUSE frames might have stopped traffic flow is a small portion of the time during which the data was captured, the presence of the PAUSE frames should not be of much concern. However, the presence of PAUSE frames will still indicate, some level of difficulty being experienced by the device that is sending the PAUSE frames.
The original purpose of the protocol was to allow the interface to PAUSE its neighbor (the interface on the other end of the cable) when there was a possibility that the neighbor was transmitting fast enough to overrun the interface. If you PAUSE the neighboring interface first, then you prevent the receiving interface from getting overwhelmed. While this can sometimes help the situation, it can also mean that the receiving interface (and the node that it is contained within) will not register any evidence of the issue other than the PAUSE frames. Therefore, if it is necessary to understand why the PAUSE frames are being sent, you may have to disable Ethernet Flow Control. Then, if the issue continues, you may see other counters or behaviors that tell you if the overwhelmed component is in the NIC, the PCI bus, or the Operating System.