ONTAP Select: Takeover is not possible and HA Interconnect RDMA down
Applies to
- NetApp ONTAP Select
- HA-Interconnect (IC)
- Storage Failover Takeover
Issue
- ONTAP HA shows disabled:
::*> storage failover show
Takeover
Node Partner Possible State Description
-------------- -------------- -------- -------------------------------------
ontap-select-01 ontap-select-02 false Waiting for ontap-select-02,
Takeover is not possible: NVRAM log
not synchronized
ontap-select-02 ontap-select-01 false Waiting for ontap-select-01,
Takeover is not possible: NVRAM log
not synchronized, Disk inventory not
exchanged
2 entries were displayed.
- HA Interconnect
Link
is eitherup
ordown
butIC RDMA connection
isdown
:
::> set adv
Warning: These advanced commands are potentially dangerous; use them only when directed to do so by NetApp personnel.
Do you want to continue? {y|n}: y
::*> node run -node * -command ic status
2 entries were acted on.
Node: ontap-select-01
Link : up
IC RDMA connection : down
Node: ontap-select-02
Link : down
IC RDMA connection : down
- When
Link
showsdown,
the status in VMWare for the vNIC showsconnected
- if not, then connect the vNIC in VMWare:
Note: refer to Solution section on how to identify the correct vNic MAC address
- Event log shows a sequence of events when the problem triggers:
Note: some events like the ones with severity debug will not show under admin privilege and might require elevation of privilege level
::*> event log show
Sat May 27 2023 17:00:26 +00:00 [ontap-select-02:cf.ic.xferTimedOutVSA:notice]: HA interconnect: ofw transfer timed out.
Sat May 27 2023 17:00:26 +00:00 [ontap-select-02:cf.fm.partnerFwTransition:info]: prevstate="SF_UP", newstate="SF_UNKNOWN", progresscounter="0"
Sat May 27 2023 17:00:28 +00:00 [ontap-select-02:cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of ontap-select-01 disabled (unsynchronized log).
Sat May 27 2023 17:00:29 +00:00 [ontap-select-02:ic.rdma.qpDisconnected:debug]: ofw is disconnected.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:cf.ic.xferTimedOutVSA:notice]: HA interconnect: wafl transfer timed out.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state MIRROR_ONLINE is aborted because of reason Abort Pending.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:sk.hog.runtime:notice]: Process wafl_exempt01 ran for 16048 milliseconds
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:mgr.stack.longrun.proc:notice]: Long running process: wafl_exempt01
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:mgr.stack.frame:notice]: Stack frame 0: maytag.ko::sk_save_stackframes(0xffffffff8942f6f0) + 0x30
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:ha.healthCheckRoundtrip:debug]: HA_HEALTH_CHECK request-id 7 start-timestamp 8294259962 round-trip time: 0 msecs.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:ha.netPartition.other:debug]: Network partition due to other error. Duration 119 msecs, takeover wait 0 msecs; error code 5; status: 0x1001; request id: 7.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:rastrace.dump.saved:debug]: A RAS trace dump for module IC instance 0 was stored in /etc/log/rastrace/IC_0_20230527_17:00:31:741638.dmp.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of ontap-select-02 by ontap-select-01 disabled (unsynchronized log).
Sat May 27 2023 17:00:35 +00:00 [ontap-select-02:rastrace.dump.saved:debug]: A RAS trace dump for module HA instance 0 was stored in /etc/log/rastrace/HA_0_20230527_17:00:35:668467.dmp.
Sat May 27 2023 17:00:51 +00:00 [ontap-select-02:nvmm.mirror.offlined:debug]: mirror="HA Partner Mirror Offlined"
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:rdma.rlib.queue.full:notice]: Send queue of QP Control is full.
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:ctrl.rdma.heartBeat:info]: HA interconnect: Missed heartbeat to 169.254.128.242.
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:sk.hog.runtime:notice]: Process ctrl_hb_port_e0f ran for 16051 milliseconds
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:mgr.stack.longrun.proc:notice]: Long running process: ctrl_hb_port_e0f
Sat May 27 2023 17:01:00 +00:00 [ontap-select-02:monitor.globalStatus.critical:EMERGENCY]: Controller failover of ontap-select-01 is not possible: unsynchronized log.
Sat May 27 2023 17:01:33 +00:00 [ontap-select-02:cf.diskinventory.sendFailed:debug]: reason="HA Interconnect down", errorCode="0"
- ESXi vmkernel log at the same time shows:
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21226: ontap-select-01.eth5,02:0c:00:00:80:f2, portID(67108922): Hang detected,numHangQ: 1, enableGen: 96
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)NetSched: 752: 0x8400000f: received a force quiesce for port 0x400003a, dropped 9 pkts
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1542 eop: 1543 enableGen: 0 qid: 96, pkt: 0x45c995a9b900
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1540 eop: 1541 enableGen: 0 qid: 96, pkt: 0x45c98885c900
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1538 eop: 1539 enableGen: 0 qid: 96, pkt: 0x45c988887980
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1536 eop: 1537 enableGen: 0 qid: 96, pkt: 0x45c9940c4f80
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1534 eop: 1535 enableGen: 0 qid: 96, pkt: 0x45c98bd48d40
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1532 eop: 1533 enableGen: 0 qid: 96, pkt: 0x45c995b63d00
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1530 eop: 1531 enableGen: 0 qid: 96, pkt: 0x45c995ba5a00
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1528 eop: 1529 enableGen: 0 qid: 96, pkt: 0x45c99413bf00
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21235: portID:67108922, QID: 0, next2TX: 1496, next2Comp: 1528, lastNext2TX: 1496, next2Write:3253, ringSize: 4096 inFlight: 18, delay(ms): 4622,txStopped: 0
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21226: ontap-select-01.eth5,02:0c:00:00:80:f2, portID(67108922): Hang detected,numHangQ: 1, enableGen: 96