Multiple ESXi hosts on Cisco UCS lost datastore connectivity
Applies to
- VMware ESXi
- Cisco UCS-FI-6332-16UP >= 4.0
- Brocade FabricOS (FOS)
Issue
-
On ESXi,
vobd.log
lists LUN resets, path loss, All Path Down (APD), Permanent Device Loss (PDL) event entries against multiple (all) datastores:
[...]
2021-11-02T12:54:31.826Z: [scsiCorrelator] 1068996446us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba1:C0:T25:L10 changed state from on
2021-11-02T12:54:31.830Z: [scsiCorrelator] 1069000506us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba1:C0:T25:L4 changed state from on
2021-11-02T12:57:17.144Z: [scsiCorrelator] 1234311968us: [vob.scsi.scsipath.por] Power-on Reset occurred on naa.600a09803830376643xxxx
2021-11-02T12:58:53.367Z: [scsiCorrelator] 1330557897us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.600a09803830376643xxxxx. Path vmhba0:C0:T21:L4 is down. Affected datastores: Unknown.
2021-11-02T13:01:13.367Z: [APDCorrelator] 1470530196us: [vob.storage.apd.timeout] Device or filesystem with identifier [naa.600a0980383037654xxxxx] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
2021-11-02T13:06:13.119Z: [APDCorrelator] 1770276511us: [vob.storage.apd.start] Device or filesystem with identifier [naa.600a09803830376643xxxx] has entered the All Paths Down state.
2021-11-02T13:06:13.119Z: [APDCorrelator] 1770309617us: [esx.problem.storage.apd.start] Device or filesystem with identifier [naa.600a09803830376643xxxx] has entered the All Paths Down state.
2021-11-02T13:06:13.120Z: [scsiCorrelator] 1770310964us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.600a09803830376643xxxxx. Path vmhba0:C0:T27:L8 is down. Affected datastores: Unknown.
[...] -
On FOS
porterrshow
indicates significant increase of c3 timeouts error counters compared to other counters, whereas RX are the result of the TX errors.
porterrshow :
frames [...] disc link loss loss frjt fbsy c3timeout pcs uncor
tx rx [...] c3 fail sync sig tx rx err err
0: 4.0g 739.0m [...] 42.3k 0 0 0 0 0 0 42.0k 0 0
1: 327.2m 2.3g [...] 16.2k 0 0 0 0 0 0 16.1k 0 0
22: 13.6k 25.6k [...] 58 0 0 0 0 0 0 58 0 0
23: 12.9k 24.8k [...] 6 0 0 0 0 0 0 6 0 0
28: 189.4m 8.7m [...] 1.1k 750 0 1 0 0 1.1k 5 0 0
29: 87.8m 6.5m [...] 1.2k 750 0 1 0 0 1.2k 3 0 0
30: 33.1m 6.0m [...] 3.0k 750 0 1 0 0 3.0k 1 1 0
31: 98.8m 21.1m [...] 1.3k 750 0 1 0 0 1.3k 1 1 0
32: 1.0g 1.1g [...] 1.6k 0 0 0 0 0 0 1.6k 0 0
33: 1.0g 1.1g [...] 950 0 0 0 0 0 0 947 0 0
34: 530.7m 1.7g [...] 2.0k 0 0 0 0 0 0 2.0k 0 0
35: 549.5m 2.2g [...] 1.7k 0 0 0 0 0 0 1.7k 0 0-
The devices connected to the ports with c3 timeouts (tx) can be identified via "nsshow" for further investigation of the affected link.
N 0c1c00; 3;pn;pn 0x00000000
SCR: None
IP address: ip
PortSymb: [22] "UCSname-A:fc1/1."
NodeSymb: [15] "UCSname-A"
Fabric Port Name: pn
Permanent Port Name: pn
Device type: Physical Unknown(initiator/target)
Port Index: 28
-
-
On FOS,
errdump -a
states correlating timeouts and link resets:
2021/11/02-13:01:36, [C3-1014], 62547, CHASSIS, WARNING, DS_6510B, Link Reset on Port S0,P31(45) vc_no=0 crd(s)lost=64 auto trigger.
2021/11/02-13:01:37, [C3-1014], 62548, CHASSIS, WARNING, DS_6510B, Link Reset on Port S0,P28(40) vc_no=0 crd(s)lost=64 auto trigger.
2021/11/02-13:01:37, [C3-1014], 62549, CHASSIS, WARNING, DS_6510B, Link Reset on Port S0,P30(42) vc_no=0 crd(s)lost=64 auto trigger.
2021/11/02-13:01:37, [AN-1014], 62550, FID 128, INFO, switchname, Frame timeout detected, tx port 28 rx port 0, sid b2201, did c1c0b, timestamp 2021-11-02 13:01:37 .
2021/11/02-13:01:37, [AN-1014], 62559, FID 128, INFO, switchname, Frame timeout detected, tx port 30 rx port 34, sid c2201, did c1e08, timestamp 2021-11-02 13:01:37 .
2021/11/02-13:01:38, [AN-1014], 62560, FID 128, INFO, switchname, Frame timeout detected, tx port 31 rx port 0, sid b2201, did c1f17, timestamp 2021-11-02 13:01:38 .
-
The fabric partner exhibits the same on ports 28 and 30
porterrshow :
frames [...] disc link loss loss frjt fbsy c3timeout pcs uncor
tx rx [...] c3 fail sync sig tx rx err err
28: 2.0m 602.4k [...] 49.5k 24 0 1 0 0 49.3k 104 17 0
29: 54.7m 4.3m [...] 11 24 0 0 0 0 0 0 0 0
30: 3.3m 104.7k [...] 34.2k 24 0 0 0 0 34.0k 98 49 0
31: 348.9m 32.9m [...] 4 24 0 9.5k 0 0 0 0 0 0