Continued packet loss when pinging from cluster LIF after cluster switch RCF upgrade

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 113

Visibility:: Public

Votes:: 0

Category:: fabric-interconnect-and-management-switches

Specialty:: HW

Last Updated:

Applies to

Cisco NX3232C Cluster Network Switch (CNS)
RCF firmware update to 1.10 or later from 1.8 or earlier

Issue

All nodes continuously report the following events when pinging each others' cluster LIFs:

[vifmgr: vifmgr.cluscheck.ctdpktloss:debug]: Continued packet loss when pinging from cluster lif node-01_clus-1 (node node-01) to cluster lif node-02_clus2 (node node-02).

[vifmgr: vifmgr.cluscheck.droppedall:alert]: Total packet loss when pinging from cluster lif node-01_clus-1 (node node-01) to cluster lif node-02_clus2 (node node-02).

With half cluster ping-cluster failing. Example:

::*> cluster ping-cluster -node node-01 ... Basic connectivity succeeds on 14 path(s) Basic connectivity fails on 14 path(s) ... Larger than PMTU communication succeeds on 14 path(s) RPC status: 14 paths up, 0 paths down (tcp check) 14 paths up, 0 paths down (udp check)

Every time a cluster port connected to the switch 1 is reverted to a LIF for the switch 2:
- EMS reports messages similar to:

vifmgr: vifmgr.dbase.checkerror:alert]: VIFMgr experienced an error verifying cluster database consistency. Some LIFs might not be hosted properly as a result.

vifmgr: vifmgr.startup.failover.err:alert]: VIFMgr encountered errors during startup.

vifmgr reports messages similar to:

[kern_vifmgr:info:6537] rdb::qm:...:src/rdb/quorum/qm_states/inq/SecondaryState.cc:222 (thr_id:0x80c138500) SecondaryState::receivePoll Leaving quorum at 21170636s apparent starvation or RPC failure at sender 1003. Sender expected VS_Unknown, actual WS_QuorumMember.

mgwd reports messages similar to:

[kern_mgwd:info:2343] A [src/rdb/quorum/qm_states/inq/SecondaryState.cc 217 (0x823d60300)]: receivePoll: Leaving quorum at 9068946s apparent starvation or RPC failure at sender 1003. Sender expected VS_Unknown, actual WS_QuorumMember.

[kern_mgwd:info:2343] A [src/rdb/cluster_events.cc 88 (0x823d60300)]: Report: Cluster event: node-event, epoch 31, site 1004 [apparent starvation detected in voting protocol].

[kern_mgwd:info:2325] W [src/rdb/TM.cc 3923 (0x821377f00)]: _coord_commit: TM 1003: Transaction TID <31,277502,277502> commit failed: UNIT_OFFLINE; declaring unstable quorum in epoch 31.  Total participating sites: 3, number of sites committed: 3, epsilon commit: true

[kern_mgwd:info:2325] rdb::TM:Mon Nov 06 11:06:47 2023:src/rdb/TM.cc:3933 (thr_id:0x821377f00) TM 1003: Transaction TID <31,277502,277502> commit failed: UNIT_OFFLINE; declaring unstable quorum in epoch 31.  Total participating sites: 3, number of sites committed: 3, epsilon commit: true

The issue remains, regardless the ISL is enabled or not (to isolate the traffic on each switch).