WQE failure in EMS with extended status 0x16 frequently occurring

Last updated

Oct 9, 2024
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 5,008

Visibility:: Public

Votes:: 4

Category:: ontap-9

Specialty:: san

Last Updated:: 10/9/2024, 10:35:41 AM

Applies to

NetApp FAS
NetApp AFF
Data ONTAP

Issue

WQE Failure Error with Ext_Status 0x16 , Ext_Status 0x1d and Ext_Status 0x2 logged on storage port under EMS log in ASUP across all the Nodes in the cluster:
- Ext_Status 0x16 means that the Host initiator has sent an abort to clear the current command queue. This does not necessarily identify the issue or root cause but a symptom / side effect.
- Ext_Status 0x1d identifies that there is out of order frame delivery.
EMS log snippet for reference:

fcp.io.status:debug]: STIO Adapter:2a IO WQE failure, Handle 0x0, Type 8, S_ID:abc VPI: 3, OX_ID: 3250, Status 0x3 Ext_Status 0x16

fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 80F, Status 0x3 Ext_Status 0x1d

fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 552, Status 0x3 Ext_Status 0x2

EMS log in ASUPs are filled with found hung cmd errors.
- hung cmd with state=5 error indicates that the FC target is waiting for something to come back from the host after accepting the write request; however, nothing came back within the expected timeout value.

Sat May 04 21:42:57 +0530 [Node1: fct_tpd_thread_1: fcp.io.status:debug]: STIO Adapter:11b, found hung cmd:0xfffff8182db43a60(state=5, flags=0x0, ctio_sent=1/1,RecvExAddr=0x2bee, OX_ID=0x2f7, RX_ID=0xffff,SID=0x51fxx, Cmd[8A], req_q_free:0)

Both Rx and Tx power are in the recommended range on storage end.

Rx- 543.1 (uWatts) Tx- 630.5 (uWatts)

On switch end , Porterrshow does not report any errors for the host or storage connected port.

/fabos/cliexec/porterrshow : frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs uncor tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err err 26: 1.0g 2.1g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 31: 1.9g 906.0m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

sfpshow on switch end looks good for both the ports and both Tx and Rx power are in recommended range too.
On switch end, under Errdump, frame timeout events reported , which also corresponds to the STIO error reported on the storage end at the same time.

Note: In the below instance/example, switch logs are in GMT and storage logs are in GMT +5:30
7:38 PM GMT March 16th is 1:08 AM March 17th

Switch end log snippet-
- 2024/03/16-19:38:12, [AN-1014], 1865, FID 128, INFO, switch, Frame timeout detected, tx port 31 rx port 26, sid 51axx, did 51fxx, timestamp 2024-03-16 19:38:12
Storage end log snippet-
- Sun Mar 17 01:08:17 +0530 [Storage-01: fct_tpd_work_thread_0: fcp.io.status:debug]: STIO Adapter:11b IO WQE failure, Handle 0x1, Type 8, S_ID: 51Fxx, VPI: 275, OX_ID: 44D, Status 0x3 Ext_Status 0x16
Frame timeout event states which port received a frame (Rx) and where it cannot be transmitted on (Tx).
In the above example, as you can see it is port 31 (Tx), where we are seeing the frame timeouts being reported on switch.
The Tx power looks good on both ends i.e storage as well as Switch, so to isolate it further, below steps can be followed:
- Start by swapping the cable on host connected port on switch end to a working switch port and check if the issue follows.
- If the issue follows, then that would prove that the switch sfp is good as the Tx power suggests too.
- Post that, proceed with other hardware checks like end device sfp and cable connecting from the end device to switch.

Along with above points, we might encounter or observe below issues too:

Active IQ Unified Manager Event: No Active Paths to Access LUN
The Network Interfaces that are used to access the LUN (mapped to initiator group xxxxx) hosted on SVM xxx are down
No link related issues observed on any of the NetApp ports which are being used by the affected Igroup.
Performance issue combined with multiple entries in the EMS log.
VMS are in not responding state or hung state and end users unable to access applications.
FRAME DROP event has been observed in the vmkernel logs

Occurs on validated Interoperability Matrix Tool configuration.
Physical layer issues identified on the physical SAN switches attached to the port in the EMS log
A region of the fabric or Host initiator with errors and performance during the timestamp has been identified
Credit loss events reported for host connected ports under show logging log as well as show process creditmon credit-loss-events.
Ciscoswitch# show logging log
Copied
:

2023 Sep 29 09:23:52 switch %LIBIFMGR-5-INTF_COUNTERS_CLEARED: Interface fc4/47, counters cleared by user 2023 Sep 29 09:46:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=33) . 2023 Sep 29 09:46:25 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) . 2023 Sep 29 09:47:23 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: Credit Loss Reco has reached the falling threshold (port=fc1/4 [0x1003000], value=0) . 2023 Sep 29 11:21:30 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=68) . 2023 Sep 29 11:21:32 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) 2023 Sep 29 11:22:03 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TX Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) . 2023 Sep 29 11:22:22 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Timeout Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) . 2023 Sep 29 11:22:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Credit Loss Reco has reached the rising threshold (port=fc1/4 [0x1003000], value=1) .

Ciscoswitch# show process creditmon credit-loss-events
Copied

Module: 01 Credit Loss Events: YES

---------------------------------------------------- | Interface | Total | Timestamp | | | Events | | ---------------------------------------------------- | fc1/4 | 526 | 1. Fri Sep 29 12:21:34 2023 | | | | 2. Fri Sep 29 11:21:30 2023 | | | | 3. Fri Sep 29 09:46:23 2023 |

Overutilization is observed on multiple F ports connected to the host.

Congestion Summary... +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | Tx Congestion problems | | +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+ | | Level 3 | Level 2 | Level 1.5 TxWait >= 30% | Level 1 TxWait < 30% | Tx Util >= 80% | | +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+ | | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | | Switchname | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | + ------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+ | switch1 | No | Yes | Yes | No | Yes | Yes | No | Yes(63%) | No | No | Yes(29%) | No | No | Yes(86%) | No | +-------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+ | switch2 | No | Yes | Yes | No | Yes | Yes | No | Yes(61%) | No | No | Yes(29%) | No | No | Yes(89%) | No | +-------------------+--------+--------+--------+--------+--------+--------+--

TXwait congestion reported on the switches on multiple port interfaces connected to the hosts.

-------------------------------------------------------------------------------------------------------------- | Slowdrain level 1.5 problems found: Yes | -------------------------------------------------------------------------------------------------------------- -------------------------------------------O B F L T x W a i t Congestion >= 30%------------------------------------------- | Interface | Counter | Count | Delta | Timestamp | ---------------------------------------------------------------------------------------------------------------------------- | fc10/46 | TxWait Congestion 41% | 8sec | 3303328 | 2023/09/29 11:41:15 | | fc10/46 | TxWait Congestion 40% | 8sec | 3226326 | 2023/09/29 11:41:35 | | fc10/46 | TxWait Congestion 39% | 7sec | 3177619 | 2023/09/29 11:41:55 | | fc10/46 | TxWait Congestion 32% | 6sec | 2565321 | 2023/09/29 15:32:56 |

TxWait means that MDS has to wait to transmit the frames because it is not receiving the R_RDY signals from the connected end device and had to check the end device as they are not sending the R_RDY signals to MDS.
TxWait increments each time when local switch waiting for r_ready (B2B credit) for 2.5 microseconds. This counters collected every 20 seconds and 30% or more in 20 seconds is consider not good. In our case we don't want to through alert every 20 seconds and it is not mature period to decide link is slow so we setup our port-monitor for 70% in 300 seconds (5minute) which is more genuine alert you can report to service provider.
Slowdrain logs says LR successful ,this clearly means that the end device / HBA / Driver / Firmware is the issue and end device needs to be investigated.

Switchname: hyd02-mds-switch1 -------------------------------------------------------------------------------------------------------------- | Slowdrain level 3 problems found: Yes | -------------------------------------------------------------------------------------------------------------- -------------------------------------------C r e d i t L o s s R e c o v e r y ------------------------------------------- | Interface | Counter | Count | Delta | Timestamp | ---------------------------------------------------------------------------------------------------------------------------- | fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 233 | 1 | 2023/09/29 04:02:11 | | fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30167 | 128 | | | fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 232 | 1 | 2023/09/29 04:00:30 | | fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30039 | 257 | | | fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 231 | 1 | 2023/09/29 03:40:07 | | fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 29782 | 166 | | | fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 230 | 1 | 2023/09/29 03:38:06 | | fc2/4 | F32_TMM_PORT_TIMEOUT_DROP