CONTAP-85180: CSM sessions in an A400 cluster with certain NICs might be re-established periodically
Issue
- CIFS or NFS latency or locking error failures after upgrade to 9.11.1
- Slowness is seen for users and cluster latency is high
- Applications suddenly have higher I/O wait times or other issues
- Applications such as Citrix may throw errors about being unable to open a file because it is locked when accessed from an indirect LIF:
Error writing file [\\CIRIX-SHARE\PATH\FILE.TXT|file:///PATH/FILE.TXT] [{*}error=The process cannot access the file because another process has locked a portion of the file.{*}]- Packet Traces shows failures
STATUS_LOCK_NOT_GRANTEDandSTATUS_FILE_CLOSED
4013 2023-02-03 03:06:57.921258 170.xx.xx.21 10.xx.xx.71 SMB2 11734 13 Write Request Len:65536 Off:0 File: Folder\file.txt4966 2023-02-03 03:07:22.280241 10.xx.xx.71 170.xx.xx.21 SMB2 131 13 Write Response, *Error: STATUS_LOCK_NOT_GRANTED*5114 2023-02-03 03:07:23.310735 170.xx.xx.2110.xx.xx.71 SMB2 146 13 Close Request File: Folder\file.txt5115 2023-02-03 03:07:23.310758 10.xx.xx.71 170.xx.xx.21 SMB2 131 13 Close Response, *Error: STATUS_FILE_CLOSED* * EMS logs show CSM RDMA connection failures{*}[node-12: kernel: csm.connectionFailed:debug]{*}: CSM failed to create a connection: localBladeUUID = node-12:nblade, remoteBladeUUID = node-11:dblade, uniquifier = 0005f3c99d66505e, transportType = CT, sessionTag = DEFAULT, localVifId = 910, remoteVifIP = 169.254.21.130, CsmError = CSM_FAIL, ctLoError = CTLOPCP_ERR_UNKNOWN, socketError = 0, and TLSerror = 0. {*}[node-12: kernel: csm.sessionFailed:debug]{*}: Cluster interconnect session (req=node-12:nblade, rsp=node-10:dblade, uniquifier=0005f3c99d7dcfaf) failed with transport type CT, session tag DEFAULT, record state ACTIVE, CSM error CSM_CONNABORTED, low-level error CTLOPCP_ERR_SOCK_CANNOTRCV, socket error -1, and TLS error 0. {*}[node-12: kernel: csm.ctFallbackActiveOpen:notice]{*}: Cluster Session Manager (CSM) could not successfully create the RDMA connections for session "0005f3c9ae22d35b" even after several retry attempts. CSM will use TCP connections as defaults. {*}[node-12: kernel: csm.connectionFailed:debug]{*}: CSM failed to create a connection: localBladeUUID = node-12:nblade, remoteBladeUUID = node-11:dblade, uniquifier = 0005f3c9ae22d35b, transportType = RDMA_RoCEv2, sessionTag = DEFAULT, localVifId = 911, remoteVifIP = 169.254.21.131, CsmError = CSM_FAIL, ctLoError = CTLOPCP_ERR_UNKNOWN, socketError = 0, and TLSerror = 0. {*}[node-12: kernel: csm.createSessionFailed:debug]{*}: Cluster Session Manager (CSM) failed to create session (req=node-12:nblade, rsp=node-10:dblade, {*}[node-12: kernel: Nblade.cifsLockStateMismatch:debug]{*}: params: \{'pMessage': 'Leaked share lock?'} 