WQE failure in EMS with extended status 0x16 frequently occurring
Applies to
- NetApp FAS
- NetApp AFF
- Data ONTAP
Issue
WQE Failure
Error withExt_Status 0x16 , Ext_Status 0x1d
andExt_Status 0x2
logged on storage port underEMS
log in ASUP across all the Nodes in the cluster:
fcp.io.status:debug]: STIO Adapter:2a IO WQE failure, Handle 0x0, Type 8, S_ID:abc VPI: 3, OX_ID: 3250, Status 0x3 Ext_Status 0x16
fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 80F, Status 0x3 Ext_Status 0x1d
fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 552, Status 0x3 Ext_Status 0x2
-
Ext_Status 0x16
means that the Host initiator has sent an abort to clear the current command queue. This does not necessarily identify the issue or root cause but a symptom / side effect. Ext_Status 0x1d
identifies that there is out of order frame delivery.
- Active IQ Unified Manager Event:
No Active Paths to Access LUN
The Network Interfaces that are used to access the LUN (mapped to initiator group xxxxx) hosted on SVM xxx are down
- No link related issues observed on any of the NetApp ports which are being used by the affected Igroup.
- Performance issue combined with multiple entries in the EMS log.
- VMS are in not responding state or hung state and end users unable to access applications.
FRAME DROP event has been observed
in the vmkernel logs
- Occurs on validated Interoperability Matrix Tool configuration.
- Physical layer issues identified on the physical SAN switches attached to the port in the EMS log
- A region of the fabric or Host initiator with errors and performance during the timestamp has been identified
-
Credit loss events
reported for host connected ports undershow logging log a
s well asshow process creditmon credit-loss-events
. -
Ciscoswitch# show logging log
:
2023 Sep 29 09:23:52 switch %LIBIFMGR-5-INTF_COUNTERS_CLEARED: Interface fc4/47, counters cleared by user
2023 Sep 29 09:46:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=33) .
2023 Sep 29 09:46:25 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 09:47:23 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: Credit Loss Reco has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 11:21:30 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=68) .
2023 Sep 29 11:21:32 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0)
2023 Sep 29 11:22:03 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TX Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:22 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Timeout Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Credit Loss Reco has reached the rising threshold (port=fc1/4 [0x1003000], value=1) .
Ciscoswitch# show process creditmon credit-loss-events
Module: 01 Credit Loss Events: YES
----------------------------------------------------
| Interface | Total | Timestamp |
| | Events | |
----------------------------------------------------
| fc1/4 | 526 | 1. Fri Sep 29 12:21:34 2023 |
| | | 2. Fri Sep 29 11:21:30 2023 |
| | | 3. Fri Sep 29 09:46:23 2023 |
- Overutilization is observed on multiple F ports connected to the host.
Congestion Summary...
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | Tx Congestion problems |
| +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
| | Level 3 | Level 2 | Level 1.5 TxWait >= 30% | Level 1 TxWait < 30% | Tx Util >= 80% |
| +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
| | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other |
| Switchname | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports |
+ ------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch1 | No | Yes | Yes | No | Yes | Yes | No | Yes(63%) | No | No | Yes(29%) | No | No | Yes(86%) | No |
+-------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch2 | No | Yes | Yes | No | Yes | Yes | No | Yes(61%) | No | No | Yes(29%) | No | No | Yes(89%) | No |
+-------------------+--------+--------+--------+--------+--------+--------+--
-
TXwait congestion
reported on the switches on multiple port interfaces connected to the hosts.
--------------------------------------------------------------------------------------------------------------
| Slowdrain level 1.5 problems found: Yes |
--------------------------------------------------------------------------------------------------------------
-------------------------------------------O B F L T x W a i t Congestion >= 30%-------------------------------------------
| Interface | Counter | Count | Delta | Timestamp |
----------------------------------------------------------------------------------------------------------------------------
| fc10/46 | TxWait Congestion 41% | 8sec | 3303328 | 2023/09/29 11:41:15 |
| fc10/46 | TxWait Congestion 40% | 8sec | 3226326 | 2023/09/29 11:41:35 |
| fc10/46 | TxWait Congestion 39% | 7sec | 3177619 | 2023/09/29 11:41:55 |
| fc10/46 | TxWait Congestion 32% | 6sec | 2565321 | 2023/09/29 15:32:56 |
TxWait
means that MDS has to wait to transmit the frames because it is not receiving theR_RDY
signals from the connected end device and had to check the end device as they are not sending theR_RDY
signals to MDS.TxWait
increments each time when local switch waiting for r_ready (B2B credit) for 2.5 microseconds. This counters collected every 20 seconds and 30% or more in 20 seconds is consider not good. In our case we don't want to through alert every 20 seconds and it is not mature period to decide link is slow so we setup our port-monitor for 70% in 300 seconds (5minute) which is more genuine alert you can report to service provider.Slowdrain
logs saysLR successful ,
this clearly means that the end device / HBA / Driver / Firmware is the issue and end device needs to be investigated.
Switchname: hyd02-mds-switch1
--------------------------------------------------------------------------------------------------------------
| Slowdrain level 3 problems found: Yes |
--------------------------------------------------------------------------------------------------------------
-------------------------------------------C r e d i t L o s s R e c o v e r y -------------------------------------------
| Interface | Counter | Count | Delta | Timestamp |
----------------------------------------------------------------------------------------------------------------------------
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 233 | 1 | 2023/09/29 04:02:11 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30167 | 128 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 232 | 1 | 2023/09/29 04:00:30 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30039 | 257 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 231 | 1 | 2023/09/29 03:40:07 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 29782 | 166 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 230 | 1 | 2023/09/29 03:38:06 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP