Skip to main content

NetApp_Insight_2020.png 

NetApp Knowledgebase

Troubleshooting LACP port channel/ interface groups

Views:
4,685
Visibility:
Public
Votes:
0
Category:
not set
Specialty:
network
Last Updated:

Applies to

Network Switches

Answer

Check Active IQ if this impacts your systems

What is a Port Channel Group?

Port Channel Group is a set of multiple physical Ethernet ports aggregated together for the purposes of increased aggregate throughput and/or for improved network resiliency. It is also known as an etherchannel, trunk, port bundle, or LACP. The Institute of Electrical and Electronics Engineers (IEEE) defined the standard for port channel groups with designation 802.3ad and 802.3ax, while port channel group refers commonly to a switch side configuration, NetApp uses the term 'interface groups / ifgrp' or the legacy name of Virtual Interface (VIF). Interface groups on the NetApp side come in three flavors.

  • SystemIDs: Each member in the lag will send a systemID in its LACPDUs. All the member ports for the filer should send the same systemID to indicate that only one logical device is connected to the switch. Similarly, each of the switch's member ports should send the same systemID. However, it is expected that the systemID will differ between the switch and filer.
    If the member ports don't send the same systemID, then one side is trying to aggregate two different logical devices which is not supported with LACP. For example, a Cisco Nexus cross-switch link-aggregation group may send two different systemIDs if a virtual port-channel (VPC) isn't configured on the switch. Configuring the link-aggregation group with a VPC tells each switch to send the same SystemID.

  • Single Mode: No switch side configuration, active/ passive only, and not a port channel group
  • Multimode: A 'static' port channel group. The storage controller and switch are hard-coded with a set number of ports that will always be members of a port channel group. It is less optimal than Multimode_LACP because the storage controller and switch cannot prevent an individual port from participating in the port channel group unless the port goes offline (example: no cable plugged in).

    If the device configured with this type of port channel group is cabled incorrectly, the 'static' multimode port channel group would also be unaware of the mis-cabling, and would use all ports for transmit. You should then see CAM table flapping ('MAC flapping') reported by the switch. Also, the network connectivity, if established, would be very inconsistent.

    Note: A switch administrator might refer the static port channel group as LACP, but a 'static' multimode ifgrp or port channel group does not use the LACP protocol, and the two terms cannot be used interchangeably.

    Multimode_LACP: A port channel group mode that allows two network devices (for example, a switch and NetApp storage controller) to communicate and compare port state and parameters. Because LACP is a communication between two ports (for example, a NIC port on a server or client and the switch port it is plugged into), LACP is able to confirm that communications between the two devices are successful. It allows either participant to make a decision about whether or not each physical port in the port-channel group should be used. This is superior to 'static' multimode port channel groups only in that it can detect some conditions that are not associated with an unplugged or totally failed port. ONTAP requires that each port channel can only have one SystemID and this cannot be spanned across multiple switches.

    There are two 'modes' of operation in an LACP port:
  • Active LACP ports always participate in LACP negotiations. If the interface in the neighboring device (the other end of the cable) is not configured for LACP use (active or passive), then this port will be disabled.
  • Passive LACP ports are allowed to participate in LACP negotiations, if the device at the other end of the cable initiates by sending an LACP control packet.
  • All NetApp storage controllers always operate in 'Active LACP' mode. This is non-configurable.
    LACP will be operating in one of two 'timings' at any given time:
  • 'Slow timers' exchange LACP control packets once every 30 seconds.
  • 'Fast timers' exchange LACP control packets once every 1 second.
  • It is expected that most devices will use 'Fast timers' when a port is initially brought 'up'. It is also expected that LACP will move to 'Slow timers' soon after LACP has properly negotiated, and the ports come up for use.
    LACP can disable a member of the port channel under several circumstances.
  • If the network device stops receiving updates from the other device, the port will eventually be disabled. This takes up to 90 seconds (it can be less).
  • If the network device receives updates from the neighboring device, but that update includes incorrect information, the network device can disable the port. This will be expected to detect mis-cabled systems, and can occur in far less than 90 seconds.
  • If the network device receives an update that indicates that the neighboring port is not useable, the port should be disabled immediately upon receiving that LACP control packet.
  • In some cases, a switch set to 'Passive' will not negotiate correctly with the storage controller, even though the storage controller is set to 'Active' by default.
  • In some cases, a switch might not move to 'Slow timers' when the storage controller moves. This can prevent LACP from bringing the ports into service.

    Note: LACP will only be able to confirm that two neighboring devices (example: NIC and Switch) are communicating directly with each other. Any other failures, such as routing failures or switch outages not affecting this port, will not be detected by LACP.
Who creates Port Channel Groups?

Storage architects and network administrators have to work together to ensure their port channel groups are configured correctly. If this collective validation is not done, the network infrastructure might be at risk.

Why understand and troubleshoot Port Channel Groups?

The common symptoms of mis-configured port channel groups are: Intermittent connectivity, packet loss, unexpected loss of redundancy, and 'flakey' network connectivity. It is best to ensure the port channel groups are 100% configured for a robust storage infrastructure.

The following are the common causes for mis-configured port channel groups:

  • The network switch is not configured correctly.
  • The cables from the storage controller NICs are connected to the wrong network switch ports.
  • The wrong ports are specified in the ifgrp configuration on the storage controller.
  • Cabling or hardware issue and/or bad Ethernet cable or switch port/module.
  • In some environments, the switch might have to be configured to use only 'Slow Timers' and 'Active LACP'

How to troubleshoot LACP port channel groups?

This section describes the procedure that should be followed when troubleshooting LACP port channel groups for storage architects: Reviewing the interface group status output or ifgrp status. If the ifgrp status output is not correct, then some correction is required.

For versions of ONTAP 9.2 and greater consider utilizing the following KB for troubleshooting assistance:
How to Troubleshoot LACP Issues with "ifconfig -v" in ONTAP 9.2+

Consider the following scenarios:

Scenario 1 - Port down

-------- IFGRP-STATUS --------
default: transmit 'IP Load balancing', Ifgrp Type 'multi_mode', fail 'log'
corp_lag1: 1 link, transmit 'IP Load balancing', Ifgrp Type 'lacp' fail 'default'
Ifgrp Status Up Addr_set
trunked: corp_ifgrp
up:
e13a: state up, since 26Feb2013 05:18:14 (4+19:01:01)
mediatype: auto-10g_sr-fd-up
flags: enabled
active aggr, aggr port: e13a
input packets 2965338456, input bytes 11151446739454
input lacp packets 13811, output lacp packets 414063
output packets 2518851712, output bytes 22373630536977
up indications 3, broken indications 0
drops (if) 0, drops (link) 0
indication: up at 26Feb2013 05:18:14
consecutive 0, transitions 3
broken:
e7a: state broken, since 26Feb2013 15:23:49 (4+08:55:26)
mediatype: auto-10g_sr-fd-down
flags: lacp enabled
input packets 0, input bytes 0
input lacp packets 1218, output lacp packets 36343
output packets 0, output bytes 0
up indications 4, broken indications 2
drops (if) 0, drops (link) 67
indication: broken at 26Feb2013 15:23:49
consecutive 0, transitions 6

This is an example of a port channel group with a single working port, and one port which is actually down. The port which is working displays active aggr, meaning it is participating in a link aggregation successfully. The 'state broken' indication on port e7a, in this example, would indicate that there is no cable plugged into the port.

Scenario 2 - Port mis-cabled

corp_lag1: 1 link, transmit 'IP Load balancing', Ifgrp Type 'lacp' fail 'default'
Ifgrp Status Up Addr_set
trunked: corp_ifgrp
up:
e13a: state up, since 22Jan2013 15:07:01 (18+09:17:13)
mediatype: auto-10g_sr-fd-up
flags: enabled
active aggr, aggr port: e13a
input packets 18140836964, input bytes 52796851685561
input lacp packets 211121, output lacp packets 6332475
output packets 15346936943, output bytes 168979152263131
up indications 21, broken indications 9
drops (if) 0, drops (link) 0
indication: up at 22Jan2013 15:07:01
consecutive 0, transitions 30
lag_inactive:
e7a: state lag_inactive, since 22Jan2013 15:06:39 (18+09:17:35)
mediatype: auto-10g_sr-fd-up
flags: lacp enabled
input packets 582405, input bytes 1193956618
input lacp packets 211095, output lacp packets 6355756
output packets 19089029, output bytes 13102456660
up indications 15, broken indications 7
drops (if) 0, drops (link) 0
indication: lag_inactive at 22Jan2013 15:06:39
consecutive 0, transitions 22

The configuration is the same here; however, e7a displays LAG_INACTIVE instead of 'broken'. This indicates the the storage controller and the switch have not agreed upon something associated with LACP. Once the storage controller decides that LACP does not support using this port, the link is marked lag_inactive, indicating the link is not 'active' in the 'lag' (Link Aggregation Group). The end result of such a situation is that port e7a will not be used by Data ONTAP for traffic, forcing all traffic to use port e13a. Redundancy, resiliency, and potential throughput gains are therefore no longer present. If port e13a also becomes lag_inactive (for any of the triggers mentioned earlier), this ifgrp will be moved to an inactive or offline state, and no traffic will be allowed in either direction.

In this condition, it is possible to gather clues about the cause by monitoring the 'input lacp packets' and 'output lacp packets' counters:

The 'input lacp packets' counter monitors how many LACP control/negotiation packets have been received from the switch. If this is not incrementing, it is implied that the switch might not be sending LACP packets, or that hardware is discarding all packets received. If the switch is configured for a 'static' multimode ifgrp, you would not see this counter incrementing.

The 'output lacp packets' counter monitors how many LACP control/negotiation packets have been sent by the storage controller. This should always be incrementing if the port is in any state other than 'broken'

Scenario 3 - Properly configured LACP ifgrp

Both ports have active aggr (ignore the aggr port part). There should be no mention of lag_inactive.

The configuration is similar to the following:

corp_lag2: 2 links, transmit 'IP Load balancing', Ifgrp Type 'lacp' fail 'default'
Ifgrp Status Up Addr_set
trunked: corp_ifgrp
up:
e7b: state up, since 26Feb2013 04:58:05 (4+19:12:27)
mediatype: auto-10g_sr-fd-up
flags: enabled
active aggr, aggr port: e13b
input packets 9493, input bytes 1177132
input lacp packets 13831, output lacp packets 415997
output packets 617618727, output bytes 2908794789546
up indications 3, broken indications 0
drops (if) 0, drops (link) 0
indication: up at 26Feb2013 04:58:05
consecutive 0, transitions 3
e13b: state up, since 26Feb2013 04:58:04 (4+19:12:28)
mediatype: auto-10g_sr-fd-up
flags: enabled
active aggr, aggr port: e13b
input packets 9494, input bytes 1177256
input lacp packets 13830, output lacp packets 414747
output packets 505651519, output bytes 4117402075645
up indications 3, broken indications 0
drops (if) 0, drops (link) 0
indication: up at 26Feb2013 04:58:04
consecutive 0, transitions 3

In this state, both ports are listed as 'enabled', and both the 'input lacp packets' and 'output lacp packets' counters should be incrementing by at least one per 30 seconds.

See Technical Report TR-3802 for best practices on the configuration of LACP port channel groups; for example, syntax on Cisco Nexus switches. Ensure understanding of the ports to which the NetApp storage controller is connected in order to validate if Layer-1 is as expected. Using the Cisco Discovery Protocol (CDP) features of Data ONTAP can assist with this validation.

node1> options cdpd.enabled on
Wait up to 60 minutes for LLDP polling intervals to succeed.
node1> cdpd show-neighbors

For Clustered Data ONTAP:

Cluster01::> node run -node Node1 options cdpd.enable on
Cluster01::> node run -node Node1 cdpd show-neighbors

 

Note: For current generation Cisco implementations, consider adding the ‘vpc peer-gateway’setting on the port-channel configuration to ensure maximum compatibility with NetApp Fastpath configuration (default ON in 7-mode and CDOT). This VPC peer-gateway statement is available in NX-OS 4.2.1 for the 7000 series switches andNX-OS Release 5.0(3)N1(1) for the 5500 series switches. See Unable to send network traffic over Cisco Nexus vPC with ip.fastpath enabled KB for more information.

Additional Information