Skip to main content
NetApp Knowledge Base

WQE failure in EMS with extended status 0x16 frequently occurring

Views:
2,862
Visibility:
Public
Votes:
4
Category:
ontap-9
Specialty:
san
Last Updated:

Applies to

  • NetApp FAS
  • NetApp AFF
  • Data ONTAP

Issue

  • WQE Failure Error with Ext_Status 0x16 , Ext_Status 0x1d and Ext_Status 0x2 logged on storage port under EMS log in ASUP across all the Nodes in the cluster:

fcp.io.status:debug]: STIO Adapter:2a IO WQE failure, Handle 0x0, Type 8, S_ID:abc VPI: 3, OX_ID: 3250, Status 0x3 Ext_Status 0x16

fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 80F, Status 0x3 Ext_Status 0x1d

fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 552, Status 0x3 Ext_Status 0x2

  •  Ext_Status 0x16 means that the Host initiator has sent an abort to clear the current command queue. This does not necessarily identify the issue or root cause but a symptom / side effect.
  • Ext_Status 0x1d identifies that there is out of order frame delivery.
     

 

  • Active IQ Unified Manager Event: No Active Paths to Access LUN
    The Network Interfaces that are used to access the LUN (mapped to initiator group xxxxx) hosted on SVM xxx are down
  • No link related issues observed on any of the NetApp ports which are being used by the affected Igroup.
  • Performance issue combined with multiple entries in the EMS log.
  • VMS are in not responding state or hung state and end users unable to access applications.
  • FRAME DROP event has been observed in the vmkernel logs

Frame_drop.png

  • Occurs on validated Interoperability Matrix Tool configuration.
  • Physical layer issues identified on the physical SAN switches attached to the port in the EMS log
  • A region of the fabric or Host initiator with errors and performance during the timestamp has been identified
  • Credit loss events reported for host connected ports under show logging log as well as show process creditmon credit-loss-events.

  • Ciscoswitch# show logging log :

2023 Sep 29 09:23:52 switch %LIBIFMGR-5-INTF_COUNTERS_CLEARED: Interface fc4/47, counters cleared by user
2023 Sep 29 09:46:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=33) .
2023 Sep 29 09:46:25 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 09:47:23 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: Credit Loss Reco has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 11:21:30 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=68) .
2023 Sep 29 11:21:32 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0)
2023 Sep 29 11:22:03 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TX Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:22 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Timeout Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Credit Loss Reco has reached the rising threshold (port=fc1/4 [0x1003000], value=1) .

  • Ciscoswitch# show process creditmon credit-loss-events

        Module: 01      Credit Loss Events: YES

----------------------------------------------------
| Interface |  Total |          Timestamp          |
|           | Events |                             |
----------------------------------------------------
| fc1/4     |    526 | 1. Fri Sep 29 12:21:34 2023 |
|           |        | 2. Fri Sep 29 11:21:30 2023 |
|           |        | 3. Fri Sep 29 09:46:23 2023 |


 

  • Overutilization is observed on multiple F ports connected to the host.

Congestion Summary...
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                   |                                                           Tx Congestion problems                                                                       |
|                   +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
|                   |          Level 3         |          Level 2         | Level 1.5 TxWait >= 30%        | Level 1 TxWait < 30%           | Tx Util >= 80%                 |
|                   +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
|                   | Mode E | Mode F | Other  | Mode E | Mode F | Other  | Mode E   | Mode F   | Other    | Mode E   | Mode F   | Other    | Mode E   | Mode F   | Other    |
| Switchname        | Ports  | Ports  | Ports  | Ports  | Ports  | Ports  | Ports    | Ports    | Ports    | Ports    | Ports    | Ports    | Ports    | Ports    | Ports    |
+ ------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch1 | No     | Yes    | Yes    | No     | Yes    | Yes    | No       | Yes(63%) | No       | No       | Yes(29%) | No       | No       | Yes(86%) | No       |
+-------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch2 | No     | Yes    | Yes    | No     | Yes    | Yes    | No       | Yes(61%) | No       | No       | Yes(29%) | No       | No       | Yes(89%) | No       |
+-------------------+--------+--------+--------+--------+--------+--------+--

 

  •  TXwait congestion reported on the switches on multiple port interfaces connected to the hosts.

 --------------------------------------------------------------------------------------------------------------
    | Slowdrain level 1.5 problems found: Yes                                                                      |
     --------------------------------------------------------------------------------------------------------------
                 -------------------------------------------O B F L  T x W a i t Congestion >= 30%-------------------------------------------
                | Interface  | Counter                                                 | Count        | Delta      | Timestamp               |
                 ----------------------------------------------------------------------------------------------------------------------------
                | fc10/46    | TxWait Congestion 41%                                   |         8sec |    3303328 | 2023/09/29 11:41:15     |
                | fc10/46    | TxWait Congestion 40%                                   |         8sec |    3226326 | 2023/09/29 11:41:35     |
                | fc10/46    | TxWait Congestion 39%                                   |         7sec |    3177619 | 2023/09/29 11:41:55     |
                | fc10/46    | TxWait Congestion 32%                                   |         6sec |    2565321 | 2023/09/29 15:32:56     |

 

 

  • TxWait means that MDS has to wait to transmit the frames because it is not receiving the R_RDY signals from the connected end device and had to check the end device as they are not sending the R_RDY signals to MDS.
  • TxWait increments each time when local switch waiting for r_ready (B2B credit) for 2.5 microseconds. This counters collected every 20 seconds and 30% or more in 20 seconds is consider not good. In our case we don't want to through alert every 20 seconds and it is not mature period to decide link is slow so we setup our port-monitor for 70% in 300 seconds (5minute) which is more genuine alert you can report to service provider.
  • Slowdrain logs says LR successful ,this clearly means that the end device / HBA / Driver / Firmware is the issue and end device needs to be investigated.

Switchname: hyd02-mds-switch1
--------------------------------------------------------------------------------------------------------------
    | Slowdrain level 3 problems found: Yes                                                                        |
     --------------------------------------------------------------------------------------------------------------
                 -------------------------------------------C r e d i t  L o s s  R e c o v e r y -------------------------------------------
                | Interface  | Counter                                                 | Count        | Delta      | Timestamp               |
                 ----------------------------------------------------------------------------------------------------------------------------
                | fc2/4      | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)             |          233 |          1 | 2023/09/29 04:02:11     |
                | fc2/4      | F32_TMM_PORT_TIMEOUT_DROP                               |        30167 |        128 |                         |
                | fc2/4      | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)             |          232 |          1 | 2023/09/29 04:00:30     |
                | fc2/4      | F32_TMM_PORT_TIMEOUT_DROP                               |        30039 |        257 |                         |
                | fc2/4      | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)             |          231 |          1 | 2023/09/29 03:40:07     |
                | fc2/4      | F32_TMM_PORT_TIMEOUT_DROP                               |        29782 |        166 |                         |
                | fc2/4      | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)             |          230 |          1 | 2023/09/29 03:38:06     |
                | fc2/4      | F32_TMM_PORT_TIMEOUT_DROP  

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.