CONTAP-484943: Some NFS ops against qtrees taking 5 minutes on 9.16.1
Issue
- After running or upgrading to 9.16.1, occasional NFS ops against qtrees are taking up to 5 minutes before getting a response.
- Linux clients with default timeo of 600 (60 seconds), and default retrans of 2 would likely see an NFS server not responding error messages after 3 minutes, and then NFS server ok 2 minutes later.
- Applications such as IBM MQ, which require faster response times, may be impacted.
- To confirm that the issue is being seen from ONTAP perspective:
- first check for hourly EMS message indicating any number of NFS operations taking over 60 seconds:
- Nblade.NfsResponseTraceTriggerHourly:debug]: params: \{'responseCount': '14', 'trigger': '60'}
- If OPS taking > 60 seconds are noted, enable nfs server traces:
- set diag; nfs server modify -vserver * -trace-enabled true
- Look for ems events showing NFS process time (procTime) taking close to 300 seconds.
- Nblade.NfsResponseTraceTrigger:debug]: params: \{'clientAddr': '10.1.1.2', 'op': 'NFSv4 COMPOUND', 'vserverId': '#', 'procTime': '297', 'trigger': '60'}
Note:
- To be exposed to this issue, systems must be running ONTAP 9.16.1 (without the fix or workaround for this issue deployed) and must be using qtree exports over NFS.
- This issue is more likely to be seen on higher-end systems with high CPU counts due to the increased concurrency these systems allow.
