Intermittent severe latency spikes occur between E-Series NVMe controller and ESXi hosts
Applies to
- E-Series using NVMe over RoCE (EF300 / EF600)
- VMware ESXi hosts
Issue
- Latency measured via VeeamOne shows random spikes
- No internal latency found on server/storage sides.
- Firmware/driver updates on ESXi hosts did not help.
- Nexus switch logs show packet drops, likely due to QoS misconfiguration.
- NVMe controller stats report frequent KeepAliveTimeout (KAT) events.
- E-Series support data log ->
NVMEOF-STATISTICS.CSV"NVMe Controller Statistics (Raw)"
- E-Series support data log ->
"NVMe Controllers","TCC","KAT","MQCF","MCCF","CR","CS"
"Controller A, HIC 2, port 2a","4","69","0","0","0","0""Controller A, HIC 2, port 2b","4","54","0","0","0","0""Controller B, HIC 2, port 2a","4","65","0","0","0","0""Controller B, HIC 2, port 2b","4","57","0","0","0","0"
"NVMe controller statistics legend"
"TCC = Total Controller Count""KAT = Keep Alive Timeouts""MQCF = Max Queue Connection Failures""MCCF = Max Controller Connection Failures""CR = NVMe Controller Resets""CS = NVMe Controller Shutdowns"
- No physical layer errors; the issue is at the NVMe protocol layer.
