Continued packet loss when pinging from cluster LIF after cluster switch RCF upgrade
Applies to
- Cisco NX3232C Cluster Network Switch (CNS)
- RCF firmware update to 1.10 or later from 1.8 or earlier
Issue
- All nodes continuously report the following events when pinging each others' cluster LIFs:
[vifmgr: vifmgr.cluscheck.ctdpktloss:debug]: Continued packet loss when pinging from cluster lif node-01_clus-1 (node node-01) to cluster lif node-02_clus2 (node node-02).
[vifmgr: vifmgr.cluscheck.droppedall:alert]: Total packet loss when pinging from cluster lif node-01_clus-1 (node node-01) to cluster lif node-02_clus2 (node node-02).
- With half cluster ping-cluster failing. Example:
::*> cluster ping-cluster -node node-01
...
Basic connectivity succeeds on 14 path(s)
Basic connectivity fails on 14 path(s)
...
Larger than PMTU communication succeeds on 14 path(s)
RPC status:
14 paths up, 0 paths down (tcp check)
14 paths up, 0 paths down (udp check)
- Every time a cluster port connected to the switch 1 is reverted to a LIF for the switch 2:
- EMS reports messages similar to:
vifmgr: vifmgr.dbase.checkerror:alert]: VIFMgr experienced an error verifying cluster database consistency. Some LIFs might not be hosted properly as a result.vifmgr: vifmgr.startup.failover.err:alert]: VIFMgr encountered errors during startup.- vifmgr reports messages similar to:
[kern_vifmgr:info:6537] rdb::qm:...:src/rdb/quorum/qm_states/inq/SecondaryState.cc:222 (thr_id:0x80c138500) SecondaryState::receivePoll Leaving quorum at 21170636s apparent starvation or RPC failure at sender 1003. Sender expected VS_Unknown, actual WS_QuorumMember.- mgwd reports messages similar to:
[kern_mgwd:info:2343] A [src/rdb/quorum/qm_states/inq/SecondaryState.cc 217 (0x823d60300)]: receivePoll: Leaving quorum at 9068946s apparent starvation or RPC failure at sender 1003. Sender expected VS_Unknown, actual WS_QuorumMember.[kern_mgwd:info:2343] A [src/rdb/cluster_events.cc 88 (0x823d60300)]: Report: Cluster event: node-event, epoch 31, site 1004 [apparent starvation detected in voting protocol].[kern_mgwd:info:2325] W [src/rdb/TM.cc 3923 (0x821377f00)]: _coord_commit: TM 1003: Transaction TID <31,277502,277502> commit failed: UNIT_OFFLINE; declaring unstable quorum in epoch 31. Total participating sites: 3, number of sites committed: 3, epsilon commit: true[kern_mgwd:info:2325] rdb::TM:Mon Nov 06 11:06:47 2023:src/rdb/TM.cc:3933 (thr_id:0x821377f00) TM 1003: Transaction TID <31,277502,277502> commit failed: UNIT_OFFLINE; declaring unstable quorum in epoch 31. Total participating sites: 3, number of sites committed: 3, epsilon commit: true- The issue remains, regardless the ISL is enabled or not (to isolate the traffic on each switch).
