Cluster switches N9K C9336 reboot caused loss of cluster communication

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 26

Visibility:: Public

Votes:: 0

Category:: ontap-9

Specialty:: hw

Last Updated:

Applies to

FAS/AFF systems
Cisco N9K-C9336C-FX2 Cluster switches
NX-OS version 10.2.5

Issue

Both cluster ports on all nodes go down simultaneously, resulting in a loss of cluster communication:

Sat Nov 01 00:30:34 [Node-01: mgmt_port_link_status_poll: netif.linkDown:info]: Ethernet e3b: Link down, check cable.
Sat Nov 01 00:31:23 [Node-01: mgmt_port_link_status_poll: netif.linkDown:info]: Ethernet e3a: Link down, check cable.

Sat Nov 01 00:30:34 [Node-02: mgmt_port_link_status_poll: netif.linkDown:info]: Ethernet e3b: Link down, check cable.
Sat Nov 01 00:31:23 [Node-02: mgmt_port_link_status_poll: netif.linkDown:info]: Ethernet e3a: Link down, check cable.

Sat Nov 01 00:30:35 [Node-03: mgmt_port_link_status_poll: netif.linkDown:info]: Ethernet e3b: Link down, check cable.
Sat Nov 01 00:31:24 [Node-03: mgmt_port_link_status_poll: netif.linkDown:info]: Ethernet e3a: Link down, check cable.

Sat Nov 01 00:30:35 [Node-04: mgmt_port_link_status_poll: netif.linkDown:info]: Ethernet e3b: Link down, check cable.
Sat Nov 01 00:31:24 [Node-04: mgmt_port_link_status_poll: netif.linkDown:info]: Ethernet e3a: Link down, check cable.

All the nodes in the cluster go out of the CLAM quorum:

Sat Nov 01 00:32:30 [Node-01: kltp: clam.node.ooq:EMERGENCY]: Node (name=Node-01, ID=1000) is out of "CLAM quorum" (reason=node in minority).
Sat Nov 01 00:32:31 [Node-02: kltp: clam.node.ooq:EMERGENCY]: Node (name=Node-01, ID=1000) is out of "CLAM quorum" (reason=node in minority).
Sat Nov 01 00:32:22 [Node-03: kltp: clam.node.ooq:EMERGENCY]: Node (name=Node-01, ID=1000) is out of "CLAM quorum" (reason=node in minority).
Sat Nov 01 00:32:31 [Node-04: kltp: clam.node.ooq:EMERGENCY]: Node (name=Node-01, ID=1000) is out of "CLAM quorum" (reason=node in minority).

The cluster's RDB becomes unsynchronized, leading to a loss of quorum.
The switch logs indicate that both cluster switches underwent a reboot and the cluster ports linked to these switches became inactive:

Cluster-switch1:

Sat Nov 1 04:39:01 2025: Card Uptime Record
----------------------------------------------
Uptime: 83, 0 days 0 hour(s) 1 minute(s) 23 second(s)
Reset Reason: Unknown (0)

Cluster-switch-2:

Sat Nov 1 04:38:33 2025: Card Uptime Record
----------------------------------------------
Uptime: 83, 0 days 0 hour(s) 1 minute(s) 23 second(s)
Reset Reason: Unknown (0)