Skip to main content
NetApp Knowledge Base

Extreme Client latency and/or hang due to shortage of long-term, replay cache buckets

Views:
1,154
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
nas
Last Updated:

Applies to

  • ONTAP 8.3.x and later
  • NFS

Issue

Based on ONTAP statistics, high latency could be observed from OPM, Grafana, Perfstat/PerfArchive in the latency breakdown section.  Depending on the tools that are used to monitor performance, the bulk of the latency is from 'CPU_NETWORK' or 'CLUSTER_INTERCONNECT'.

Several errors/warnings could be found from various logs:

  1. CSM timeouts observed from request blade(Nblade) in a perfstat 'sysctl sysvar.csm', the same output could be collected manually from SystemShell by running 'sysctl sysvar.csm'.

Example output:

SpinNPSessionInt::timeout): this=0xffffff80085e1028, sessionId=(req=cluster_n01:nblade, rsp=cluster_n02:dblade, uniquifier=00053816b2747090): In last 3974071360 ms, 104 of 2168524218 Ops timed out, 2171533701 started, 0 Ops timed out unsent. 4289664640/0/0 Ops await replies, 0 segs sent, 0 await ACKs

  1. CSM Flowcontrol on the receiver node(Dblade) in a perfstat 'sysctl sysvar.csm', the same output could be collected manually from SystemShell by running 'sysctl sysvar.csm'.

Example output:

SpinNPSessionInt::processSessionFlowcontrolQueue): sess = 0xffffff8007bdf028, sessionId = (req=c55f68b8-7cc0-11e4-84e6-098b9834504d, rsp=cluster_n02:dblade, uniquifier=00053816b2747090), iface = 1, delivered REQUEST pkt = 0xffffff05931fa271 to flow control list

  1. Nblade.nfsConnResetAndClose - 'Maximum number of rewind attempts has been exceeded' could be found from EMS logs.

Example Output:

Nblade.nfsConnResetAndClose: Shutting down connection with the client. Vserver ID is xx; network data protocol is NFS; client IP address:port is xx.xx.xx.xx:xxx. local IP address is xx.xx.xx.xx; reason is CSM error - Maximum number of rewind attempts has been exceeded.

  1. High SpinNP latency outliers observed from a perfstat ‘stats spinnp’ section, check and make sure it increments across iterations. The same output could also be collected manually by running 'statistics show -object spinnp -raw' from ClusterShell(diag mode).

Example Output:

spinnp:spinnp:latency_hist.<1s:2577819
spinnp:spinnp:latency_hist.<2s:7878237
spinnp:spinnp:latency_hist.<4s:6262884
spinnp:spinnp:latency_hist.<6s:1629240
spinnp:spinnp:latency_hist.<8s:307280
spinnp:spinnp:latency_hist.<10s:85273
spinnp:spinnp:latency_hist.<20s:145299
spinnp:spinnp:latency_hist.<30s:51447
spinnp:spinnp:latency_hist.<60s:30
spinnp:spinnp:latency_hist.<90s:10
spinnp:spinnp:latency_hist.<120s:6
spinnp:spinnp:latency_hist.>120s:50

  1. Spinhi statistics indicates that almost all the Spinhi requests are in deferred queue, this could be found from a perfstat 'spinhi_stats' section, or manually collect it from nodeshell by running 'spinhi_stats'(diag mode).

Example Output:

(spinhi_stats) size=39502 total_req=421874001827 cur_req=25780 max_req=26702 total_resp=421873962781 total_replay_resp=289138 defer_req=55765 cur_defer=25780 max_defer=25780 hipri=15603269 unmarshal_errs=0 marshal_errs=0 fastpath_null_resps=0 cur_nogrow_filecb_bulk=0, cur_nogrow_filecb_op=0 redo=131995, max_nogrow_filecb_bulk=0 max_nogrow_filecb_fileop=0 Access: count=44862084546 hipri=0 errs=77411717 elapsed: max=14087030.76 avg=280.45

cur_req: Current number of requests in SpinHi
cur_defer: Current number of requests in SpinHi Defer Queue
If cur_defer == cur_req, that means, all the current requests at Spinhi are in the Defer Queue
Counter "spinnp_replay_max_long_term_hit" increments across iterations in a perfstat section 'stats spinnp_replay_cache', for example:
spinnp_replay_cache:spinnp_replay_cache:spinnp_replay_max_long_term_hit:20467472
spinnp_replay_max_long_term_hit: Total number of times max long term limit was hit"

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.