Unsynchronized logs due to high-availability interconnect log transfer "Send queue of QP wafl is full" condition
Applies to
- AFF8040,AFF A300, AFF A220, AFF A200, AFF C190
- FAS8200, FAS2750, FAS2720, FAS2650, FAS2620
- ONTAP 9
- Cloud Volumes ONTAP (CVO)
- Amazon FSx for NetApp ONTAP
Issue
- High CPU utilization is driven by WAFL_EXEMPT domain.
- The following error message is displayed in EMS event logs:
Sun Nov 01 02:14:22 KST [node: wafl_exempt10: rdma.rlib.queue.full:notice]: Send queue of QP wafl is full.
Sun Nov 01 02:14:22 KST [node: wafl_exempt10: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state MIRROR_ONLINE is aborted because of reason Abort Pending.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: ems.engine.suppressed:debug]: Event 'ic.rdma.qpDisconnected' suppressed 18 times in last 33933 seconds.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: ic.rdma.qpDisconnected:debug]: wafl is disconnected.
Sun Nov 01 02:14:22 KST [node: nvram_sync: nvmm.mirror.offlined:debug]: params: {'mirror': 'HA Partner Mirror Offlined'}
Sun Nov 01 02:14:22 KST [node: rastrace_dump: rastrace.dump.saved:debug]: A RAS trace dump for module IC instance 0 was stored in /etc/log/rastrace/IC_0_20201101_02:14:22:671353.dmp.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: ems.engine.suppressed:debug]: Event 'ic.rdma.qpConnected' suppressed 18 times in last 33933 seconds.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: ic.rdma.qpConnected:debug]: wafl is connected.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: rdma.rlib.connected:debug]: wafl QP is now connected.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: rdma.rlib.connected:debug]: raid QP is now connected.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: rdma.rlib.connected:debug]: misc QP is now connected.
Sun Nov 01 02:14:23 KST [node: cf_main: cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of node disabled (unsynchronized log).
Sun Nov 01 02:14:23 KST [node: cf_main: cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of node by node disabled (unsynchronized log).
Sun Nov 01 02:14:31 KST [node: ctlg_flxlg_mirror: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state MIRROR_SYNCING_OTHER is aborted because of reason Abort Pending.
Sun Nov 01 02:14:34 KST [node: wafl_exempt14: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state MIRROR_SYNCING_CP1_START is aborted because of reason Abort Pending.
Sun Nov 01 02:14:56 KST [node: nvram_sync: nvmm.mirror.onlined:debug]: params: {'mirror': 'HA Partner Mirror Onlined'}
Sun Nov 01 02:14:57 KST [node: cf_main: cf.fsm.takeoverByPartnerEnabled:notice]: Failover monitor: takeover of node by node enabled
Sun Nov 01 02:15:00 KST [node: monitor: monitor.globalStatus.critical:EMERGENCY]: Controller failover of node is not possible: unsynchronized log.
Sun Nov 01 02:15:01 KST [node: cf_main: cf.fsm.takeoverOfPartnerEnabled:notice]: Failover monitor: takeover of node enabled
Note: This article does not apply to MetroCluster systems that use an FCVI interconnect.