Slowness due to port reporting high bus overrun discards in ifstat
Applies to
- ONTAP 9
- CIFS/SMB
- NFS
- LACP
Issue
- NAS protocols performance issues
- Client side FIO tests on NAS mounpoints from an HA pair report variable results :
- Node01 mount report high R/W throughput while the same test on Node02 mount report low throughput :
[20:42:18] root@client:/test01 # fio fiotest.fio
job1: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64
...
job2: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64
...
fio-3.19
Starting 16 processes
job1: Laying out IO file (1 file / 5120MiB)
job2: Laying out IO file (1 file / 5120MiB)
....
...
Run status group 0 (all jobs):
READ: bw=520MiB/s (545MB/s), 520MiB/s-520MiB/s (545MB/s-545MB/s), io=304GiB (327GB), run=600121-600121msec
WRITE: bw=520MiB/s (546MB/s), 520MiB/s-520MiB/s (546MB/s-546MB/s), io=305GiB (327GB), run=600121-600121msec
[20:54:44] root@client:/test02 # fio fiotest.fio
job1: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64
...
job2: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64
...
fio-3.19
Starting 16 processes
job1: Laying out IO file (1 file / 5120MiB)
job2: Laying out IO file (1 file / 5120MiB)
.....
....
Run status group 0 (all jobs):
READ: bw=19.4MiB/s (20.3MB/s), 19.4MiB/s-19.4MiB/s (20.3MB/s-20.3MB/s), io=11.4GiB (12.3GB), run=602978-602978msec
WRITE: bw=19.5MiB/s (20.5MB/s), 19.5MiB/s-19.5MiB/s (20.5MB/s-20.5MB/s), io=11.5GiB (12.3GB), run=602978-602978msec
- Packet traces reveal high no. of zero windows reporting from ONTAP Node02 port while the same on Node01 is very low :
ifstat
output
-- interface e0f (6 days, 16 hours, 29 minutes, 42 seconds) --
RECEIVE
Total frames: 949m | Frames/second: 1642 | Total bytes: 3612g
Bytes/second: 6251k | Total errors: 0 | Errors/minute: 0
Total discards: 127k | Discards/minute: 13 | Multi/broadcast: 134k
Non-primary u/c: 0 | CRC errors: 0 | Runt frames: 0
Long frames: 0 | Length errors: 0 | Alignment errors: 0
No buffer: 0 | Pause: 0 | Jumbo: 0
Noproto: 0 | Bus overruns: 127k | LRO segments: 7122m
LRO bytes: 3543g | LRO6 segments: 0 | LRO6 bytes: 0
Bad UDP cksum: 0 | Bad UDP6 cksum: 0 | Bad TCP cksum: 0
Bad TCP6 cksum: 0 | Mcast v6 solicit: 0
TRANSMIT
Total frames: 755m | Frames/second: 1307 | Total bytes: 3584g
Bytes/second: 6203k | Total errors: 0 | Errors/minute: 0
Total discards: 0 | Queue overflow: 0 | Multi/broadcast: 64689
Pause: 0 | Jumbo: 1506m | Cfg Up to Downs: 0
TSO segments: 469m | TSO bytes: 3341g | TSO6 segments: 0
TSO6 bytes: 0 | HW UDP cksums: 0 | HW UDP6 cksums: 0
HW TCP cksums: 755m | HW TCP6 cksums: 0 | Mcast v6 solicit: 0
DEVICE
Mcast addresses: 3 | Rx MBuf Sz: 4096
LINK INFO
Speed: 10000M | Duplex: full | Flowcontrol: none
Media state: active | Up to downs: 1
RECEIVE
side increments discards and Multi/broadcast traffic in short time.
-- interface e0g (0 hours, 4 minutes, 29 seconds) --
RECEIVE
Total frames: 3989 | Frames/second: 15 | Total bytes: 612k
Bytes/second: 2278 | Total errors: 0 | Errors/minute: 0
Total discards: 102m | Discards/minute: 22767k | Multi/broadcast: 199m
Non-primary u/c: 0 | CRC errors: 0 | Runt frames: 0
Long frames: 0 | Length errors: 0 | Alignment errors: 0
No buffer: 0 | Pause: 0 | Jumbo: 258k
Noproto: 0 | Bus overruns: 102m | LRO segments: 652m
LRO bytes: 2574 | LRO6 segments: 0 | LRO6 bytes: 0
Bad UDP cksum: 0 | Bad UDP6 cksum: 0 | Bad TCP cksum: 0
Bad TCP6 cksum: 0 | Mcast v6 solicit: 0 | Lagg errors: 0
Lacp errors: 0 | Lacp PDU errors: 0
- Single port may be transitioning in and out from the interface group:
net.ifgrp.lacp.link.inactive:error]: ifgrp a0a, port e0g has transitioned to an inactive state. The interface group is in a degraded state.
net.ifgrp.lacp.link.active:notice]: ifgrp a0a, port e0g has transitioned to the active state.
vifmgr: vifmgr.cluscheck.crcerrors:alert]: Port a0a on node node01 is reporting a high number of observed hardware errors, possibly CRC errors.
vifmgr: vifmgr.port.monitor.failed:error]: The "link_flapping" health check for port a0a (node node01) has failed. The port is operating in a degraded state.