Skip to main content
NetApp Knowledge Base

WQE failure in EMS with extended status 0x16 frequently occurring

Views:
4,398
Visibility:
Public
Votes:
4
Category:
ontap-9
Specialty:
SAN
Last Updated:

Applies to

  • NetApp FAS
  • NetApp AFF
  • Data ONTAP

Issue

  • WQE Failure Error with Ext_Status 0x16 , Ext_Status 0x1d and Ext_Status 0x2 logged on storage port under EMS log in ASUP across all the Nodes in the cluster:
    •  Ext_Status 0x16 means that the Host initiator has sent an abort to clear the current command queue. This does not necessarily identify the issue or root cause but a symptom / side effect.
    • Ext_Status 0x1d identifies that there is out of order frame delivery.
  • EMS log snippet for reference:

fcp.io.status:debug]: STIO Adapter:2a IO WQE failure, Handle 0x0, Type 8, S_ID:abc VPI: 3, OX_ID: 3250, Status 0x3 Ext_Status 0x16

fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 80F, Status 0x3 Ext_Status 0x1d

fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 552, Status 0x3 Ext_Status 0x2

  •  EMS log in ASUPs are filled with found hung cmd errors.
    • hung cmd with state=5 error indicates that the FC target is waiting for something to come back from the host after accepting the write request; however, nothing came back within the expected timeout value.

Sat May 04 21:42:57 +0530 [Node1: fct_tpd_thread_1: fcp.io.status:debug]: STIO Adapter:11b, found hung cmd:0xfffff8182db43a60(state=5, flags=0x0, ctio_sent=1/1,RecvExAddr=0x2bee, OX_ID=0x2f7, RX_ID=0xffff,SID=0x51fxx, Cmd[8A], req_q_free:0)
 

Rx- 543.1 (uWatts)
Tx- 630.5 (uWatts)

  • On switch end , Porterrshow does not report any errors for the host or storage connected port.

/fabos/cliexec/porterrshow :
          frames        enc     crc     crc     too     too     bad     enc    disc    link    loss    loss    frjt    fbsy     c3timeout     pcs      uncor
        tx       rx      in     err     g_eof   shrt    long    eof     out    c3      fail    sync    sig                      tx      rx      err     err
  26:    1.0g    2.1g    0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
  31:    1.9g  906.0m    0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0

  • sfpshow on switch end looks good for both the ports and both Tx and Rx power are in recommended range too.
  • On switch end, under Errdump, frame timeout events reported , which also corresponds to the STIO error reported on the storage end at the same time.

Note: In the below instance/example, switch logs are in GMT and storage logs are in GMT +5:30
7:38 PM GMT March 16th is 1:08 AM March 17th

  • Switch end log snippet-
    • 2024/03/16-19:38:12, [AN-1014], 1865, FID 128, INFO, switch, Frame timeout detected, tx port 31 rx port 26, sid 51axx, did 51fxx, timestamp 2024-03-16 19:38:12
  • Storage end log snippet-
    • Sun Mar 17 01:08:17 +0530 [Storage-01: fct_tpd_work_thread_0: fcp.io.status:debug]: STIO Adapter:11b IO WQE failure, Handle 0x1, Type 8, S_ID: 51Fxx, VPI: 275, OX_ID: 44D, Status 0x3 Ext_Status 0x16
  • Frame timeout event states which port received a frame (Rx) and where it cannot be transmitted on (Tx).
  • In the above example, as you can see it is port 31 (Tx), where we are seeing the frame timeouts being reported on switch.
  • The Tx power looks good on both ends i.e storage as well as Switch, so to isolate it further, below steps can be followed:
    • Start by swapping the cable on host connected port on switch end to a working switch port and check if the issue follows.
    • If the issue follows, then that would prove that the switch sfp is good as the Tx power suggests too.
    • Post that, proceed with other hardware checks like end device sfp and cable connecting from the end device to switch.

 

Along with above points, we might encounter or observe below issues too:

  • Active IQ Unified Manager Event: No Active Paths to Access LUN
    The Network Interfaces that are used to access the LUN (mapped to initiator group xxxxx) hosted on SVM xxx are down
  • No link related issues observed on any of the NetApp ports which are being used by the affected Igroup.
  • Performance issue combined with multiple entries in the EMS log.
  • VMS are in not responding state or hung state and end users unable to access applications.
  • FRAME DROP event has been observed in the vmkernel logs

Frame_drop.png

  • Occurs on validated Interoperability Matrix Tool configuration.
  • Physical layer issues identified on the physical SAN switches attached to the port in the EMS log
  • A region of the fabric or Host initiator with errors and performance during the timestamp has been identified
  • Credit loss events reported for host connected ports under show logging log as well as show process creditmon credit-loss-events.

  • Ciscoswitch# show logging log :

2023 Sep 29 09:23:52 switch %LIBIFMGR-5-INTF_COUNTERS_CLEARED: Interface fc4/47, counters cleared by user
2023 Sep 29 09:46:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=33) .
2023 Sep 29 09:46:25 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 09:47:23 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: Credit Loss Reco has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 11:21:30 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=68) .
2023 Sep 29 11:21:32 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0)
2023 Sep 29 11:22:03 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TX Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:22 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Timeout Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Credit Loss Reco has reached the rising threshold (port=fc1/4 [0x1003000], value=1) .

  • Ciscoswitch# show process creditmon credit-loss-events

        Module: 01      Credit Loss Events: YES

----------------------------------------------------
| Interface |  Total |          Timestamp          |
|           | Events |                             |
----------------------------------------------------
| fc1/4     |    526 | 1. Fri Sep 29 12:21:34 2023 |
|           |        | 2. Fri Sep 29 11:21:30 2023 |
|           |        | 3. Fri Sep 29 09:46:23 2023 |

 

  • Overutilization is observed on multiple F ports connected to the host.

Congestion Summary...
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                   |                                                           Tx Congestion problems                                                                       |
|                   +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
|                   |          Level 3         |          Level 2         | Level 1.5 TxWait >= 30%        | Level 1 TxWait < 30%           | Tx Util >= 80%                 |
|                   +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
|                   | Mode E | Mode F | Other  | Mode E | Mode F | Other  | Mode E   | Mode F   | Other    | Mode E   | Mode F   | Other    | Mode E   | Mode F   | Other    |
| Switchname        | Ports  | Ports  | Ports  | Ports  | Ports  | Ports  | Ports    | Ports    | Ports    | Ports    | Ports    | Ports    | Ports    | Ports    | Ports    |
+ ------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch1 | No     | Yes    | Yes    | No     | Yes    | Yes    | No       | Yes(63%) | No       | No       | Yes(29%) | No       | No       | Yes(86%) | No       |
+-------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch2 | No     | Yes    | Yes    | No     | Yes    | Yes    | No       | Yes(61%) | No       | No       | Yes(29%) | No       | No       | Yes(89%) | No       |
+-------------------+--------+--------+--------+--------+--------+--------+--

 

  •  TXwait congestion reported on the switches on multiple port interfaces connected to the hosts.

     --------------------------------------------------------------------------------------------------------------
    | Slowdrain level 1.5 problems found: Yes                                                                      |
     --------------------------------------------------------------------------------------------------------------
                 -------------------------------------------O B F L  T x W a i t Congestion >= 30%-------------------------------------------
                | Interface  | Counter                                                 | Count        | Delta      | Timestamp               |
                 ----------------------------------------------------------------------------------------------------------------------------
                | fc10/46    | TxWait Congestion 41%                                   |         8sec |    3303328 | 2023/09/29 11:41:15     |
                | fc10/46    | TxWait Congestion 40%                                   |         8sec |    3226326 | 2023/09/29 11:41:35     |
                | fc10/46    | TxWait Congestion 39%                                   |         7sec |    3177619 | 2023/09/29 11:41:55     |
                | fc10/46    | TxWait Congestion 32%                                   |         6sec |    2565321 | 2023/09/29 15:32:56     |

  • TxWait means that MDS has to wait to transmit the frames because it is not receiving the R_RDY signals from the connected end device and had to check the end device as they are not sending the R_RDY signals to MDS.
  • TxWait increments each time when local switch waiting for r_ready (B2B credit) for 2.5 microseconds. This counters collected every 20 seconds and 30% or more in 20 seconds is consider not good. In our case we don't want to through alert every 20 seconds and it is not mature period to decide link is slow so we setup our port-monitor for 70% in 300 seconds (5minute) which is more genuine alert you can report to service provider.
  • Slowdrain logs says LR successful ,this clearly means that the end device / HBA / Driver / Firmware is the issue and end device needs to be investigated.

Switchname: hyd02-mds-switch1
--------------------------------------------------------------------------------------------------------------
    | Slowdrain level 3 problems found: Yes                                                                        |
     --------------------------------------------------------------------------------------------------------------
                 -------------------------------------------C r e d i t  L o s s  R e c o v e r y -------------------------------------------
                | Interface  | Counter                                                 | Count        | Delta      | Timestamp               |
                 ----------------------------------------------------------------------------------------------------------------------------
                | fc2/4      | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)             |          233 |          1 | 2023/09/29 04:02:11     |
                | fc2/4      | F32_TMM_PORT_TIMEOUT_DROP                               |        30167 |        128 |                         |
                | fc2/4      | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)             |          232 |          1 | 2023/09/29 04:00:30     |
                | fc2/4      | F32_TMM_PORT_TIMEOUT_DROP                               |        30039 |        257 |                         |
                | fc2/4      | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)             |          231 |          1 | 2023/09/29 03:40:07     |
                | fc2/4      | F32_TMM_PORT_TIMEOUT_DROP                               |        29782 |        166 |                         |
                | fc2/4      | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)             |          230 |          1 | 2023/09/29 03:38:06     |
                | fc2/4      | F32_TMM_PORT_TIMEOUT_DROP  

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.