Skip to main content
NetApp Knowledge Base

Multiple ESXi hosts on Cisco UCS lost datastore connectivity

Views:
560
Visibility:
Public
Votes:
1
Category:
fabric-interconnect-and-management-switches
Specialty:
san
Last Updated:

Applies to

  • VMware ESXi
  • Cisco UCS-FI-6332-16UP >= 4.0
  • Brocade FabricOS (FOS)

Issue

  • On ESXi, vobd.log lists LUN resets, path loss, All Path Down (APD), Permanent Device Loss (PDL) event entries against multiple (all) datastores:

    [...]
    2021-11-02T12:54:31.826Z: [scsiCorrelator] 1068996446us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba1:C0:T25:L10 changed state from on
    2021-11-02T12:54:31.830Z: [scsiCorrelator] 1069000506us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba1:C0:T25:L4 changed state from on
    2021-11-02T12:57:17.144Z: [scsiCorrelator] 1234311968us: [vob.scsi.scsipath.por] Power-on Reset occurred on naa.600a09803830376643xxxx
    2021-11-02T12:58:53.367Z: [scsiCorrelator] 1330557897us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.600a09803830376643xxxxx. Path vmhba0:C0:T21:L4 is down. Affected datastores: Unknown.
    2021-11-02T13:01:13.367Z: [APDCorrelator] 1470530196us: [vob.storage.apd.timeout] Device or filesystem with identifier [naa.600a0980383037654xxxxx] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
    2021-11-02T13:06:13.119Z: [APDCorrelator] 1770276511us: [vob.storage.apd.start] Device or filesystem with identifier [naa.600a09803830376643xxxx] has entered the All Paths Down state.
    2021-11-02T13:06:13.119Z: [APDCorrelator] 1770309617us: [esx.problem.storage.apd.start] Device or filesystem with identifier [naa.600a09803830376643xxxx] has entered the All Paths Down state.
    2021-11-02T13:06:13.120Z: [scsiCorrelator] 1770310964us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.600a09803830376643xxxxx. Path vmhba0:C0:T27:L8 is down. Affected datastores: Unknown.
    [...]

  • On FOS porterrshow indicates significant increase of c3 timeouts error counters compared to other counters, whereas RX are the result of the TX errors.

    porterrshow        :
              frames     [...]  disc   link   loss   loss   frjt   fbsy  c3timeout    pcs    uncor
           tx     rx     [...]  c3    fail    sync   sig                  tx    rx     err    err
      0:    4.0g 739.0m  [...]  42.3k   0      0      0      0      0      0     42.0k   0      0
      1:  327.2m   2.3g  [...]  16.2k   0      0      0      0      0      0     16.1k   0      0
     22:   13.6k  25.6k  [...]  58      0      0      0      0      0      0     58      0      0
     23:   12.9k  24.8k  [...]   6      0      0      0      0      0      0      6      0      0
     28:  189.4m   8.7m  [...]   1.1k 750      0      1      0      0      1.1k   5      0      0
     29:   87.8m   6.5m  [...]   1.2k 750      0      1      0      0      1.2k   3      0      0
     30:   33.1m   6.0m  [...]   3.0k 750      0      1      0      0      3.0k   1      1      0
     31:   98.8m  21.1m  [...]   1.3k 750      0      1      0      0      1.3k   1      1      0
     32:    1.0g   1.1g  [...]   1.6k   0      0      0      0      0      0      1.6k   0      0
     33:    1.0g   1.1g  [...] 950      0      0      0      0      0      0    947      0      0
     34:  530.7m   1.7g  [...]   2.0k   0      0      0      0      0      0      2.0k   0      0
     35:  549.5m   2.2g  [...]   1.7k   0      0      0      0      0      0      1.7k   0      0

    • The devices connected to the ports with c3 timeouts (tx) can be identified via "nsshow" for further investigation of the affected link.

      N    0c1c00;      3;pn;pn 0x00000000
      SCR: None
      IP address: ip
      PortSymb: [22] "UCSname-A:fc1/1."
      NodeSymb: [15] "UCSname-A"
      Fabric Port Name: pn
      Permanent Port Name: pn
      Device type: Physical Unknown(initiator/target)
      Port Index: 28

       

  • On FOS, errdump -a states correlating timeouts and link resets:

    2021/11/02-13:01:36, [C3-1014], 62547, CHASSIS, WARNING, DS_6510B,  Link Reset on Port S0,P31(45) vc_no=0 crd(s)lost=64 auto trigger.
    2021/11/02-13:01:37, [C3-1014], 62548, CHASSIS, WARNING, DS_6510B,  Link Reset on Port S0,P28(40) vc_no=0 crd(s)lost=64 auto trigger.
    2021/11/02-13:01:37, [C3-1014], 62549, CHASSIS, WARNING, DS_6510B,  Link Reset on Port S0,P30(42) vc_no=0 crd(s)lost=64 auto trigger.
    2021/11/02-13:01:37, [AN-1014], 62550, FID 128, INFO, switchname, Frame timeout detected, tx port 28 rx port 0, sid b2201, did c1c0b, timestamp 2021-11-02 13:01:37 .
    2021/11/02-13:01:37, [AN-1014], 62559, FID 128, INFO, switchname, Frame timeout detected, tx port 30 rx port 34, sid c2201, did c1e08, timestamp 2021-11-02 13:01:37 .
    2021/11/02-13:01:38, [AN-1014], 62560, FID 128, INFO, switchname, Frame timeout detected, tx port 31 rx port 0, sid b2201, did c1f17, timestamp 2021-11-02 13:01:38 .

     

  • The fabric partner exhibits the same on ports 28 and 30
    porterrshow        :
              frames     [...]  disc   link   loss   loss   frjt   fbsy  c3timeout    pcs    uncor
           tx     rx     [...]  c3    fail    sync   sig                  tx    rx     err    err
     28:    2.0m 602.4k  [...]  49.5k  24      0      1      0      0     49.3k 104     17      0
     29:   54.7m   4.3m  [...]  11     24      0      0      0      0      0      0      0      0
     30:    3.3m 104.7k  [...]  34.2k  24      0      0      0      0     34.0k  98     49      0
     31:  348.9m  32.9m  [...]   4     24      0      9.5k   0      0      0      0      0      0

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.