Slowness due to port reporting high bus overrun discards in ifstat

Last updated

Oct 24, 2024
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 2,141

Visibility:: Public

Votes:: 1

Category:: ontap-9

Specialty:: nas

Last Updated:: 10/24/2024, 11:34:53 PM

Applies to

ONTAP 9
CIFS/SMB
NFS
LACP

Issue

NAS protocols performance issues
Client side FIO tests on NAS mounpoints from an HA pair report variable results :
- Node01 mount report high R/W throughput while the same test on Node02 mount report low throughput :

[20:42:18] root@client:/test01 # fio fiotest.fio job1: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64 ... job2: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64 ... fio-3.19 Starting 16 processes job1: Laying out IO file (1 file / 5120MiB) job2: Laying out IO file (1 file / 5120MiB) .... ... Run status group 0 (all jobs): READ: bw=520MiB/s (545MB/s), 520MiB/s-520MiB/s (545MB/s-545MB/s), io=304GiB (327GB), run=600121-600121msec WRITE: bw=520MiB/s (546MB/s), 520MiB/s-520MiB/s (546MB/s-546MB/s), io=305GiB (327GB), run=600121-600121msec

[20:54:44] root@client:/test02 # fio fiotest.fio job1: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64 ... job2: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64 ... fio-3.19 Starting 16 processes job1: Laying out IO file (1 file / 5120MiB) job2: Laying out IO file (1 file / 5120MiB) ..... .... Run status group 0 (all jobs): READ: bw=19.4MiB/s (20.3MB/s), 19.4MiB/s-19.4MiB/s (20.3MB/s-20.3MB/s), io=11.4GiB (12.3GB), run=602978-602978msec WRITE: bw=19.5MiB/s (20.5MB/s), 19.5MiB/s-19.5MiB/s (20.5MB/s-20.5MB/s), io=11.5GiB (12.3GB), run=602978-602978msec

Packet traces reveal high no. of zero windows reporting from ONTAP Node02 port while the same on Node01 is very low :

2010136962 - zero window.png

ifstat output

-- interface e0f (6 days, 16 hours, 29 minutes, 42 seconds) --

RECEIVE side increments discards and Multi/broadcast traffic in short time.

-- interface e0g (0 hours, 4 minutes, 29 seconds) --

Single port may be transitioning in and out from the interface group:

net.ifgrp.lacp.link.inactive:error]: ifgrp a0a, port e0g has transitioned to an inactive state. The interface group is in a degraded state. net.ifgrp.lacp.link.active:notice]: ifgrp a0a, port e0g has transitioned to the active state.

vifmgr: vifmgr.cluscheck.crcerrors:alert]: Port a0a on node node01 is reporting a high number of observed hardware errors, possibly CRC errors.

vifmgr: vifmgr.port.monitor.failed:error]: The "link_flapping" health check for port a0a (node node01) has failed. The port is operating in a degraded state.