WQE failure in EMS with extended status 0x16 frequently occurring
Applies to
- NetApp FAS
- NetApp AFF
- Data ONTAP
Issue
WQE Failure
Error withExt_Status 0x16 , Ext_Status 0x1d
andExt_Status 0x2
logged on storage port underEMS
log in ASUP across all the Nodes in the cluster:-
Ext_Status 0x16
means that the Host initiator has sent an abort to clear the current command queue. This does not necessarily identify the issue or root cause but a symptom / side effect. Ext_Status 0x1d
identifies that there is out of order frame delivery.
-
EMS
log snippet for reference:
fcp.io.status:debug]: STIO Adapter:2a IO WQE failure, Handle 0x0, Type 8, S_ID:abc VPI: 3, OX_ID: 3250, Status 0x3 Ext_Status 0x16
fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 80F, Status 0x3 Ext_Status 0x1d
fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 552, Status 0x3 Ext_Status 0x2
-
EMS
log in ASUPs are filled withfound hung cmd
errors.hung cmd with state=5
error indicates that the FC target is waiting for something to come back from the host after accepting the write request; however, nothing came back within the expected timeout value.
Sat May 04 21:42:57 +0530 [Node1: fct_tpd_thread_1: fcp.io.status:debug]: STIO Adapter:11b, found hung cmd:0xfffff8182db43a60(state=5, flags=0x0, ctio_sent=1/1,RecvExAddr=0x2bee, OX_ID=0x2f7, RX_ID=0xffff,SID=0x51fxx, Cmd[8A], req_q_free:0)
- Both Rx and Tx power are in the recommended range on storage end.
Rx- 543.1 (uWatts)
Tx- 630.5 (uWatts)
- On switch end ,
Porterrshow
does not report any errors for the host or storage connected port.
/fabos/cliexec/porterrshow :
frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs uncor
tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err err
26: 1.0g 2.1g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
31: 1.9g 906.0m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sfpshow
on switch end looks good for both the ports and both Tx and Rx power are in recommended range too.- On switch end, under
Errdump
,frame timeout
events reported , which also corresponds to the STIO error reported on the storage end at the same time.
Note: In the below instance/example, switch logs are in GMT and storage logs are in GMT +5:30
7:38 PM GMT March 16th is 1:08 AM March 17th
- Switch end log snippet-
2024/03/16-19:38:12, [AN-1014], 1865, FID 128, INFO, switch, Frame timeout detected, tx port 31 rx port 26, sid 51axx, did 51fxx, timestamp 2024-03-16 19:38:12
- Storage end log snippet-
Sun Mar 17 01:08:17 +0530 [Storage-01: fct_tpd_work_thread_0: fcp.io.status:debug]: STIO Adapter:11b IO WQE failure, Handle 0x1, Type 8, S_ID: 51Fxx, VPI: 275, OX_ID: 44D, Status 0x3 Ext_Status 0x16
Frame timeout
event states which port received a frame (Rx) and where it cannot be transmitted on (Tx).- In the above example, as you can see it is port 31 (Tx), where we are seeing the
frame timeouts
being reported on switch. - The Tx power looks good on both ends i.e storage as well as Switch, so to isolate it further, below steps can be followed:
- Start by swapping the cable on host connected port on switch end to a working switch port and check if the issue follows.
- If the issue follows, then that would prove that the switch sfp is good as the Tx power suggests too.
- Post that, proceed with other hardware checks like end device sfp and cable connecting from the end device to switch.
Along with above points, we might encounter or observe below issues too:
- Active IQ Unified Manager Event:
No Active Paths to Access LUN
The Network Interfaces that are used to access the LUN (mapped to initiator group xxxxx) hosted on SVM xxx are down
- No link related issues observed on any of the NetApp ports which are being used by the affected Igroup.
- Performance issue combined with multiple entries in the
EMS
log. - VMS are in not responding state or hung state and end users unable to access applications.
FRAME DROP event has been observed
in thevmkernel
logs
- Occurs on validated Interoperability Matrix Tool configuration.
- Physical layer issues identified on the physical SAN switches attached to the port in the
EMS
log - A region of the fabric or Host initiator with errors and performance during the timestamp has been identified
-
Credit loss events
reported for host connected ports undershow logging log a
s well asshow process creditmon credit-loss-events
. -
Ciscoswitch# show logging log
:
2023 Sep 29 09:23:52 switch %LIBIFMGR-5-INTF_COUNTERS_CLEARED: Interface fc4/47, counters cleared by user
2023 Sep 29 09:46:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=33) .
2023 Sep 29 09:46:25 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 09:47:23 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: Credit Loss Reco has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 11:21:30 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=68) .
2023 Sep 29 11:21:32 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0)
2023 Sep 29 11:22:03 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TX Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:22 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Timeout Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Credit Loss Reco has reached the rising threshold (port=fc1/4 [0x1003000], value=1) .
Ciscoswitch# show process creditmon credit-loss-events
Module: 01 Credit Loss Events: YES
----------------------------------------------------
| Interface | Total | Timestamp |
| | Events | |
----------------------------------------------------
| fc1/4 | 526 | 1. Fri Sep 29 12:21:34 2023 |
| | | 2. Fri Sep 29 11:21:30 2023 |
| | | 3. Fri Sep 29 09:46:23 2023 |
Overutilization
is observed on multiple F ports connected to the host.
Congestion Summary...
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | Tx Congestion problems |
| +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
| | Level 3 | Level 2 | Level 1.5 TxWait >= 30% | Level 1 TxWait < 30% | Tx Util >= 80% |
| +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
| | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other |
| Switchname | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports |
+ ------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch1 | No | Yes | Yes | No | Yes | Yes | No | Yes(63%) | No | No | Yes(29%) | No | No | Yes(86%) | No |
+-------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch2 | No | Yes | Yes | No | Yes | Yes | No | Yes(61%) | No | No | Yes(29%) | No | No | Yes(89%) | No |
+-------------------+--------+--------+--------+--------+--------+--------+--
-
TXwait congestion
reported on the switches on multiple port interfaces connected to the hosts.
--------------------------------------------------------------------------------------------------------------
| Slowdrain level 1.5 problems found: Yes |
--------------------------------------------------------------------------------------------------------------
-------------------------------------------O B F L T x W a i t Congestion >= 30%-------------------------------------------
| Interface | Counter | Count | Delta | Timestamp |
----------------------------------------------------------------------------------------------------------------------------
| fc10/46 | TxWait Congestion 41% | 8sec | 3303328 | 2023/09/29 11:41:15 |
| fc10/46 | TxWait Congestion 40% | 8sec | 3226326 | 2023/09/29 11:41:35 |
| fc10/46 | TxWait Congestion 39% | 7sec | 3177619 | 2023/09/29 11:41:55 |
| fc10/46 | TxWait Congestion 32% | 6sec | 2565321 | 2023/09/29 15:32:56 |
TxWait
means that MDS has to wait to transmit the frames because it is not receiving theR_RDY
signals from the connected end device and had to check the end device as they are not sending theR_RDY
signals to MDS.TxWait
increments each time when local switch waiting for r_ready (B2B credit) for 2.5 microseconds. This counters collected every 20 seconds and 30% or more in 20 seconds is consider not good. In our case we don't want to through alert every 20 seconds and it is not mature period to decide link is slow so we setup our port-monitor for 70% in 300 seconds (5minute) which is more genuine alert you can report to service provider.Slowdrain
logs saysLR successful ,
this clearly means that the end device / HBA / Driver / Firmware is the issue and end device needs to be investigated.
Switchname: hyd02-mds-switch1
--------------------------------------------------------------------------------------------------------------
| Slowdrain level 3 problems found: Yes |
--------------------------------------------------------------------------------------------------------------
-------------------------------------------C r e d i t L o s s R e c o v e r y -------------------------------------------
| Interface | Counter | Count | Delta | Timestamp |
----------------------------------------------------------------------------------------------------------------------------
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 233 | 1 | 2023/09/29 04:02:11 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30167 | 128 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 232 | 1 | 2023/09/29 04:00:30 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30039 | 257 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 231 | 1 | 2023/09/29 03:40:07 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 29782 | 166 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 230 | 1 | 2023/09/29 03:38:06 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP