e0a/e0b link flaps on A300/FAS8200, A200/FAS2600, A220/FAS2700,C190 may cause a Takeover

Last updated

Nov 25, 2024
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 24,784

Visibility:: Public

Votes:: 15

Category:: fas-systems

Specialty:: HW

Last Updated:: 11/25/2024, 7:18:29 PM

Applies to

AFF A300, FAS8200
AFF A200, FAS2650, FAS2620
AFF A220, AFF C190, FAS2750, FAS2720
ONTAP 9

Issue

Cluster ports e0a or e0b (or both ports) experience link flaps or go down at the same time.

Tue Oct 03 11:08:31 CEST [node1: ixgbe/e0b: snmp.link.down:info]: Interface 2 is down. Tue Oct 03 11:08:31 CEST [node1: ixgbe/e0b: netif.linkDown:info]: Ethernet e0b: Link down, check cable. Tue Oct 03 11:08:31 CEST [node1: ixgbe/e0a: snmp.link.down:info]: Interface 1 is down. Tue Oct 03 11:08:31 CEST [node1: ixgbe/e0a: netif.linkDown:info]: Ethernet e0a: Link down, check cable.

Tue Oct 03 11:08:32 CEST [node2: ixgbe/e0b: snmp.link.down:info]: Interface 2 is down. Tue Oct 03 11:08:32 CEST [node2: ixgbe/e0b: netif.linkDown:info]: Ethernet e0b: Link down, check cable. Tue Oct 03 11:08:32 CEST [node2: ixgbe/e0a: snmp.link.down:info]: Interface 1 is down. Tue Oct 03 11:08:32 CEST [node2: ixgbe/e0a: netif.linkDown:info]: Ethernet e0a: Link down, check cable.

Check cluster port status and storage failover status:

cluster1::> network port show -ipspace Cluster

Node: cluster1-01 Speed(Mbps) Health Port IPspace Broadcast Domain Link MTU Admin/Oper Status --------- ------------ ---------------- ---- ---- ----------- -------- e0a Cluster Cluster down 9000 1000/- - e0b Cluster Cluster down 9000 1000/- -

Node: cluster1-02 Speed(Mbps) Health Port IPspace Broadcast Domain Link MTU Admin/Oper Status --------- ------------ ---------------- ---- ---- ----------- -------- e0a Cluster Cluster down 9000 1000/- - e0b Cluster Cluster down 9000 1000/- - 4 entries were displayed.

cluster1::> storage failover show

    Takeover

Node              Partner            Possible      State Description

-------------     --------------     --------      -------------------------------------

cluster1-01        cluster1-02        false        Connected to cluster-02, Partial

                                                   giveback, Takeover is not possible:

                                                   The version of software running on

                                                   each node of the SFO pair is

                                                   incompatible, NVRAM log not synchronized

cluster1-02        cluster1-01          -          Waiting for cluster applications to

                                                   come online on the local node

                                                   Offline applications: mgmt, vldb,

                                                   vifmgr, bcomd, crs.

If the ports do not come back up and if Connectivity, Liveliness and Availability Monitor (CLAM) is enabled

An "out of quorum" panic will occur on one of the nodes.

PANIC : Received PANIC packet from partner, receiving message is (Coredump and takeover initiated because Connectivity, Liveliness and Availability Monitor (CLAM) has determined this node is out of quorum.

The node that panics will be taken over and the surviving node will be serving all data.

If the ports do not come back up and if Connectivity, Liveliness and Availability Monitor (CLAM) is NOT enabled

There will not be a storage takeover and both nodes will go out of quorum. Neither node will be serving data.
See: SU436: [Impact: Critical] CLAM takeover default configuration changed
Similar messages can be found in EMS log:

Jun 08 12:30:09 [xxx-02:vifmgr.clus.linkdown:EMERGENCY]: The cluster port e0b on node naptp06c-02 has gone down unexpectedly. Jun 08 12:30:10 [xxxc-02:vifmgr.clus.linkdown:EMERGENCY]: The cluster port e0a on node naptp06c-02 has gone down unexpectedly. Jun 08 12:31:00 [xxx-02:monitor.globalStatus.critical:EMERGENCY]: Controller failover of xxx-01 is not possible: partner mailbox disks not accessible or invalid. One or more mirrored aggregates are degraded. Jun 08 12:31:02 [xxx:callhome.clam.node.ooq:EMERGENCY]: Call home for NODE(S) OUT OF CLUSTER QUORUM.