How to troubleshoot frontend SAN issues using the fcHosts 3 command (part of supportdata collection)

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 2,710

Visibility:: Public

Votes:: 1

Category:: fc-series

Specialty:: esg

Last Updated:

Applies to

Fibre channel

Description

Introduction

This article describes how the fcHosts 3 shell command can be used to find the bad component in a frontend SAN if too many bad FC frames are received. In such a case, the controller logs one of the following events in the Major Event Log (MEL):

0x1207 Fibre channel link errors - threshold exceeded or 0x1206 Fibre channel link errors continue

The thresholds are defined in NVSRAM in Offset 0x38. The rule of thumb is, if the event mentioned above is seen in the Event Log, the user should also see an impact on the affected server(s), and therefore, the issue should be investigated.

Overview

The fcHosts 3 shell command is very old but a useful command. It is part of the supportdata collection (captured in statecapturedata.txt) if an FC host connection is found.
The output displays the history of the communication between the FC HBA and the controller port to which the HBA is logged.
The maximum number of events that are listed is 50.
The downside of the command is that the output lists only the time and not the date; therefore, the events could have occurred days ago. However, in case of an issue, a lot of events are usually logged within a short period of time and, most of the time, it is from the present day the support data was captured.

Example:

The following is an example of the information provided:

Executing fcHosts(3,0,0,0,0,0,0,0,0,0) on controller A:

<-snip->

=============== HOST 10 =====================

Hst-Role(Ch) PortId PortWwn NodeWwn DstNPort CmdRecv Label 10-Host( 2) 0e0000 10008c7c-ff2057ba 20008c7c-ff2057ba 0eaabf80 195719 SRV-MDC2-HBA-P0

PERMITS: 0x00000008 HsdPort FLAGS: 0x00001406 Plogi Prli LoginRcvd Analyze LastActivity: 11/19/13-17:15:47 (GMT)

HOST LOG==> logCtl:0eab1540 logIndex: 5 goodIoCount:88836 dstNPort:0eaabf80 maxIndex: 50 logIoCount:1

RepeatCounts -- IO Types (R=read,W=write,O=other,N=nonScsi) Num Time LogCode Qualifier LogCode GoodIo Outstand Cnt Type Cnt Type Cnt Type 1 11:58:40 First 00000000 1 ---- >100K RWO- 1 ---- 2 13:45:47 RscnRecv 000e0000 1 ---- 1 ---- 1 ---- 3 13:45:47 Logout RscnMis 1 ---- 1 ---- 1 ---- 4 13:49:51 Login ff2057ba 1 ---- <100 R--- 1 ---- 5 13:49:51 ChkCond 06290400 <10 R--- <100 R--- 1 R--- 6 13:49:51 RscnRecv 000e0000 1 ---- <100K RWO- 1 ----

Explanation:

10-Host : 10 is the same as the ITN number in tditnall output of the same controller
Ch(X) : 2 is the channel (Host Port) the HBA is logged into. Use fcAll/chall to find out the host port
PortId : 0e0000 is the 24 bit address of the switch port the HBA is connected to
PortWwn : 10008c7c-ff2057ba is the FC Port WWN of the HBA
NodeWwn : 20008c7c-ff2057ba is the FC Node WWN of the HBA
CmdRecv : 195719 shows how many SCSI command where received from this HBA
Label : SRV-MDC2-HBA-P0 is the Alias of the HBA defined in Santricity
Time : The Time when the event happened (there is no Date).
LogCode : The event that happened
Qualifier : A qualifies of the event in case of a check condition (ChkCond) it is the sense data
LogCodeCnt: Number of consecutive occurrences of this logCode event
GoodIo Cnt: Number of IOs returned with good status after 1st occurrence
Outst. Cnt: Number of outstanding IOs when 1st occurrence logged

The following is seen from the example above:

The HBA with WWPN 10008c7c-ff2057ba is connected to the FC switch port 0x0e0000. It is connected to the controller through channel 2 (use fcAll/chall to find the real host port). The user has given the HBA in the Host mapping section of SANtricity the Alias SRV-MDC2-HBA-P0. From the History, it can be observed that the HBA First (beginning of the capture) sent multiple IOs without issues, and then sent a Rescan following an FC Port Logout and Login. The controller confirmed the Logout/Login by returning a check condition with a sense key of 06 asc 29 ascq 04 which decodes to "Device Internal Reset". The HBA then sent another Rescan. Overall, there is no indication of a communication issue between the HBA and the controller. A few Login/Logouts are usually not an issue.

List of LogCodes and Qualifiers

LogCodes:

AbtsRecv = Session Abort received (is an indication of a path issue)
ChkCond = Controller send SCSI check condition (sense data) to HBA (see Qualifier for details)
First = Start of capture
GoodIo = HBA send good IO
Login = HBA did a Port Login into the controller
Logout = HBA logged out of the controller
LinkDown = Connection to the HBA is down
Qfull = Queue full condition met
ResvCon = Controller returned a reservation conflict to the HBA (could be normal in a cluster configuration!)
RscnRecv = HBA send a Rescan
ScsiStat = Other SCSI status occurred

Qualifiers (most common only)

Count = Count (Lowlevel FC error. Indication of a path issue)
Discnct = Disconnect
FreezeTO = Freeze Timeout
Logo = Logout
Observed = Event observed
ReplyTO = Replay Timeout (Indication of a path issue)
RscnMis = Rescan device missing

The following are two examples of an HBA having issues talking to the controller:

                                        RepeatCounts -- IO Types
                                   (R=read,W=write,O=other,N=nonScsi)
Num Time       LogCode  Qualifier   LogCode     GoodIo      Outstand  
                                    Cnt Type    Cnt Type    Cnt Type42 15:20:40    GoodIo   00000000     1 ----   <100 -W--   1 -W-- 
 43 15:21:16  SetError    ReplyTO     1 -W--   <100 -W--      1 -W-- 
 44 15:21:21   ChkCond   0b470000    <5 -W--   <100 -W--     <5 -W-- 
 45 15:21:26  SetError    ReplyTO     1 -W--    <10 RW--      1 -W-- 
 46 15:21:28   ChkCond   0b470000    <5 -W--   <100 -W--      1 -W-- 
 47 15:22:36  AbtsRecv   00000000     1 -W--    <1K RW--      1 -W-- 
 48 15:22:42   ChkCond   0b470000   <10 -W--    <1K RW--      1 -W-- 
 49 15:23:11    GoodIo   00000000     1 ----    <10 -W--      1 -W-- 
 50 15:23:12   ChkCond   0b470000   <10 -W--    <1K RW--      1 -W--

The sense 0b/47/00 decodes to "SCSI Parity Error".
In a FC work this means that the the controller received a FC frame with incorrect CRC.

and:

 42 17:13:11  SetError    ReplyTO    <5 -W--    <1K -W--     <5 -W-- 
 43 17:13:14    GoodIo   00000000     1 ----    <1K -W--    <10 -W-- 
 44 17:13:14  SetError    ReplyTO    <5 -W--    <1K -W--     <5 -W-- 
 45 17:13:16    GoodIo   00000000     1 ----    <1K -W--     <5 -W-- 
 46 17:13:19  SetError    ReplyTO     1 -W--  <100K RW--      1 -W-- 
 47 17:13:55  SetError      Count    <5 -W--    <1K -W--      1 -W-- 
 48 17:13:56    GoodIo   00000000     1 ----    <1K -W--     <5 -W-- 
 49 17:14:05  SetError    ReplyTO    <5 -W--   <100 R---     <5 -W-- 
 50 17:14:12    GoodIo   00000000     1 ----  <100K RW--      1 R---

Note: If there is an HBA with a lot of the above errors, it does NOT automatically mean the HBA is faulty. It means the fc frames are corrupted or dropped somewhere between this HBA and the Controller. A bad HBA is just one possible candidate causing the issue.