Skip to main content
NetApp Knowledge Base

ONTAP Select: Takeover is not possible and HA Interconnect RDMA down

Views:
1,346
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
ontapselect
Last Updated:

Applies to

  • NetApp ONTAP Select
  • HA-Interconnect (IC)
  • Storage Failover Takeover

Issue

  • ONTAP HA shows disabled:
::*> storage failover show
                                Takeover
Node            Partner         Possible State Description
--------------  --------------  -------- -------------------------------------
ontap-select-01 ontap-select-02 false    Waiting for ontap-select-02,
                                         Takeover is not possible: NVRAM log
                                         not synchronized
ontap-select-02 ontap-select-01 false    Waiting for ontap-select-01,
                                         Takeover is not possible: NVRAM log
                                         not synchronized, Disk inventory not
                                         exchanged
2 entries were displayed.
 
  • HA Interconnect Link is either up or down but IC RDMA connection is down:
::> set adv
Warning: These advanced commands are potentially dangerous; use them only when directed to do so by NetApp personnel.
Do you want to continue? {y|n}: y

::*> node run -node * -command ic status
2 entries were acted on.
Node: ontap-select-01
Link : up
IC RDMA connection : down
Node: ontap-select-02
Link : down
IC RDMA connection : down
 
  • When Link shows down, the status in VMWare for the vNIC shows connected - if not, then connect the vNIC in VMWare:

    Note: refer to Solution section on how to identify the correct vNic MAC address

    clipboard_e1b0622fa177a9c2dd3849eae7343d6d3.png

  • Event log shows a sequence of events when the problem triggers:

    Note: some events like the ones with severity debug will not show under admin privilege and might require elevation of privilege level
 
::*> event log show
Sat May 27 2023 17:00:26 +00:00 [ontap-select-02:cf.ic.xferTimedOutVSA:notice]: HA interconnect: ofw transfer timed out.
Sat May 27 2023 17:00:26 +00:00 [ontap-select-02:cf.fm.partnerFwTransition:info]: prevstate="SF_UP", newstate="SF_UNKNOWN", progresscounter="0"
Sat May 27 2023 17:00:28 +00:00 [ontap-select-02:cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of ontap-select-01 disabled (unsynchronized log).
Sat May 27 2023 17:00:29 +00:00 [ontap-select-02:ic.rdma.qpDisconnected:debug]: ofw is disconnected.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:cf.ic.xferTimedOutVSA:notice]: HA interconnect: wafl transfer timed out.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state MIRROR_ONLINE is aborted because of reason Abort Pending.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:sk.hog.runtime:notice]: Process wafl_exempt01 ran for 16048 milliseconds
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:mgr.stack.longrun.proc:notice]: Long running process: wafl_exempt01
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:mgr.stack.frame:notice]: Stack frame  0: maytag.ko::sk_save_stackframes(0xffffffff8942f6f0) + 0x30
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:ha.healthCheckRoundtrip:debug]: HA_HEALTH_CHECK request-id 7 start-timestamp 8294259962 round-trip time: 0 msecs.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:ha.netPartition.other:debug]: Network partition due to other error. Duration 119 msecs, takeover wait 0 msecs; error code 5; status: 0x1001; request id: 7.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:rastrace.dump.saved:debug]: A RAS trace dump for module IC instance 0 was stored in /etc/log/rastrace/IC_0_20230527_17:00:31:741638.dmp.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of ontap-select-02 by ontap-select-01 disabled (unsynchronized log).
Sat May 27 2023 17:00:35 +00:00 [ontap-select-02:rastrace.dump.saved:debug]: A RAS trace dump for module HA instance 0 was stored in /etc/log/rastrace/HA_0_20230527_17:00:35:668467.dmp.
Sat May 27 2023 17:00:51 +00:00 [ontap-select-02:nvmm.mirror.offlined:debug]: mirror="HA Partner Mirror Offlined"
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:rdma.rlib.queue.full:notice]: Send queue of QP Control is full.
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:ctrl.rdma.heartBeat:info]: HA interconnect: Missed heartbeat to 169.254.128.242.
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:sk.hog.runtime:notice]: Process ctrl_hb_port_e0f ran for 16051 milliseconds
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:mgr.stack.longrun.proc:notice]: Long running process: ctrl_hb_port_e0f

Sat May 27 2023 17:01:00 +00:00 [ontap-select-02:monitor.globalStatus.critical:EMERGENCY]: Controller failover of ontap-select-01 is not possible: unsynchronized log.
Sat May 27 2023 17:01:33 +00:00 [ontap-select-02:cf.diskinventory.sendFailed:debug]: reason="HA Interconnect down", errorCode="0"
  • ESXi vmkernel log at the same time shows:
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21226: ontap-select-01.eth5,02:0c:00:00:80:f2, portID(67108922): Hang detected,numHangQ: 1, enableGen: 96
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)NetSched: 752: 0x8400000f: received a force quiesce for port 0x400003a, dropped 9 pkts
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1542 eop: 1543 enableGen: 0 qid: 96, pkt: 0x45c995a9b900
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1540 eop: 1541 enableGen: 0 qid: 96, pkt: 0x45c98885c900
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1538 eop: 1539 enableGen: 0 qid: 96, pkt: 0x45c988887980
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1536 eop: 1537 enableGen: 0 qid: 96, pkt: 0x45c9940c4f80
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1534 eop: 1535 enableGen: 0 qid: 96, pkt: 0x45c98bd48d40
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1532 eop: 1533 enableGen: 0 qid: 96, pkt: 0x45c995b63d00
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1530 eop: 1531 enableGen: 0 qid: 96, pkt: 0x45c995ba5a00
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1528 eop: 1529 enableGen: 0 qid: 96, pkt: 0x45c99413bf00
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21235: portID:67108922, QID: 0, next2TX: 1496, next2Comp: 1528, lastNext2TX: 1496, next2Write:3253, ringSize: 4096 inFlight: 18, delay(ms): 4622,txStopped: 0
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21226: ontap-select-01.eth5,02:0c:00:00:80:f2, portID(67108922): Hang detected,numHangQ: 1, enableGen: 96

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.