csm.ontapNoMemory errors can cause latency or outright hangs
Applies to
- ONTAP 9.10.1, 9.11.1
- Large clusters (more than 20 nodes) of AFF A400 systems
Issue
- Cluster session manager (CSM) might experience Out of Memory errors, leading to latency or hangs for processes which require inter-node communication
- EMS indicates the issue with the following messages:
- csm.createSessionFailed: CSM failed to create a connection ... transportType = RDMA_RoCEv2
- csm.ontapNoMemory: (very long message details - not included)
- csm.ctFallbackActiveOpen: Cluster Session Manager (CSM) could not successfully create the RDMA connections for session "0005face6e3a9d46" even after several retry attempts. CSM will use TCP connections as defaults (TCP fallback does not actually occur)