NFS sessions hung and high latency is reported because of load balancer problems
Applies to
- ONTAP 9
- FabricPool
Issue
- NFS sessions hung issue reported on all the volumes that are configured with FabricPool in the cluster.
- It starts with one of the node in the cluster reporting high latency.
- nblade_execsOverLimit_1 and Nblade.nfsLongRunningOp errors seen in EMS logs
[Node1: kernel: Nblade.nfsLongRunningOp:debug]: Detected a long running network process operation. The client IP address:port is 92.X.X.66:694. The local IP address:port is 10.X.X.82:2049. The protocol requesting the operation is NFS3. The RPC program number for the operation is 100003. The protocol procedure for the operation is Read (6). The disk process UUID is 05238d4dXXXXXXXXXXXXX160cedebc32. The Vserver associated with the operation is XXXX. The UID of the user is 23068. The MSID for the volume is 2161146647. The inode number of the file is 12955.
[Node1: kernel: nblade_execsOverLimit_1:debug]: params: {'clientIpAddress': '10.X.X.58', 'lifIpAddress': '10.X.X.64', 'vserverId': '4', 'execsLimit': '128'} - If attempted to perform takeover/giveback of the affected node, it may panic with the below string
RPANIC:giveback or arl hung in wafl while doing SENDHOME_DOING_COMMIT in SK process sendhome_hang_detector on release 9.8P19 (C)
- The issue doesn't reoccur for sometime post TO/GB of the affected node.
- sktrace logs indicate cloud io error
[5:0] CLOUD_BIN_ERR: cio_error_to_raid_error: Cloud-bin read block 35286487738791 data unavailable cloud io error 9 btid: 8969343 btuuid: cab5f25b-3425-476f-a361-11a69e7db847, seq_num: 1241209
[13:0] CLOUD_BIN_ERR: cio_error_to_raid_error: Cloud-bin read block 35277266573844 data unavailable cloud io error 9 btid: 40852388 btuuid: f23e6bae-2ef0-4168-b611-3e3d87274447, seq_num: 1637591
[13:0] CLOUD_BIN_ERR: cio_error_to_raid_error: Cloud-bin read block 35284330697390 data unavailable cloud io error 9 btid: 40183670 btuuid: d06a9d55-46c3-473c-b06f-0c6091fa3b02, seq_num: 171567 - Object store show command shows available
cluster::> storage aggregate object-store show
Aggregate Object Store Name Availability
-------------- ----------------- -------------
aggr1 s3_bucket available - As per storage aggregate object-store profiler start command, the PUT has 0 failures, however all of the GET showed the same amount of failures as total run.
Object store config name:
s3_bucket
Node name: Node1
Status: Done
Start time: 8/2/2023 15:08:38
Op Size Total Failed Latency(ms) Throughput
min max avg
-------------------------------------------------------------------------------
PUT 4MB 1041 0 91 17799 2891 66.98MB
GET 4KB 77095 270 5 35501 94 4.28MB
GET 8KB 284 284 10003 35502 23920 0B
GET 32KB 297 297 10000 35000 22532 0B
GET 256KB 285 285 9999 33006 22843 0B
5 entries were displayed. - StorageGrid is configured as the capacity tier.
- No issues seen on the StorageGrid nodes.