MetroCluster: LIFs offline while performing giveback
Applies to
- MetroCluster IP
- Cluster ports on MetroCluster backend ports
- MetroCluster backend port offline
- Giveback
Issue
- Booting a node and performing a giveback while one or more MetroCluster backend ports are offline can cause cluster out of quorum
- VifMgr (Virtual Interface Manager) will then be taken offline which in turn will trigger FreeBSD to take all LIFs offline to avoid duplicate IP conflicts
Example:
node_03 VifMgr fails to join quorum using cluster LIF 1 on e0a
[kern_vifmgr:info:9017] A [src/rdb/TM.cc 1621 (0x80ea38600)]: _triggerOnlineStatusCallback: TM 1002: Report UNIT_IS_OFFLINE (epoch 0, master 0). Reason: RW_TXN txn could not acquire transaction: RPC failure ().
[kern_vifmgr:info:9017] A [src/rdb/TM.cc 1625 (0x80ea38600)]: _triggerOnlineStatusCallback: FAILOVER rdb: Local unit VifMgr offline
node_03 VifMgr attempts to move cluster LIF 1 to another port, but fails because it is OOQ
[kern_vifmgr:info:9017] [0x812356d00] [Net::CdbLifHandle::avoidDownPorts] LIF lif:cdb:node_03:node_03_clus1 (1000) is assigned to a down port (node_03:e0a). Attempting to reassign.
[kern_vifmgr:info:9017] Warning: Unable to list entries on node node_04. RPC: Port mapper failure [from vifmgr on node "node_03" (VSID: -3) to mgwd at 169.254.249.59]
node_04 VifMgr loses quorum because it fails to communicate with node_03
[kern_vifmgr:info:9156] A [src/rdb/cluster_events.cc 88 (0x80e836c00)]: Report: Cluster event: cluster-quorum-ends, epoch 31, site 1003 [not enough healthy nodes (1/2 healthy)].
[kern_vifmgr:info:9156] A [src/rdb/quorum/qm_states/inq/HoldingQuorumState.cc 55 (0x80e836c00)]: doWork: Master losing quorum, not enough votes to maintain quorum at 2248s.
node_04 does not regain quorum within 65 seconds grace period and offlines any LIFs that could be hosted on node_03 to avoid a splitbrain/duplicate IP scenario
[kern_vifmgr:info:9156] [0x80ae37300] [EventMgr::unitOffline] Setting VifMgr operational status as OOQ
[kern_vifmgr:info:9156] [0x80ae37300] [FailoverMgr::localNodeDown] VifMgr on node node_04 is now out of quorum.
[node_04: vifmgr: vifmgr.lifBeingRemoved:notice]: LIF data_01 (on virtual server 7), IP address 1.11.20.12, is being removed from node node_04, port a0a-120.