CONTAP-64446: FlexGroup's excessive use of CSM agent threads might cause a delay in delivering packets for local connections
Issue
- Configuration includes FlexGroups.
- Volume move jobs are running.
- CSM connection timeout errors are logged:
Fri Jan 19 XX:XX:XX -0X00 [XXXXX: CsmMpAgentThread: csm.createSessionFailed:debug]: Cluster Session Manager (CSM) failed to create session (req=XXXXX:dblade, rsp=scc111n09a:dblade, uniquifier=11060f4e7cae0ff5) with transport type NULL, session tag WAFL_REMOTE, record state ACTIVE, CSM error CSM_CONNABORTED, low-level error UNKNOWN, socket error 0, and TLS error 0. - There is an unexpected takeover due to software panic:
Fri Jan 19 XX:XX:XX -0X00 [XXXXX: nodewatchdog: nodewatchdog.monitor.history:debug]: mgwd null[mgwd] S0 0,5? -31,5? -61,5? -91,5? -121,5? -151,5? -211,0 -241,0 -270,0 -301,0 -331,0 -360,0 -391,0 -421,0 -451,0 -481,0 -511,0 -541,0 -571,0 -600,0
Fri Jan 19 XX:XX:XX -0X00 [XXXXX: nodewatchdog: nodewatchdog.node.panic:alert]: Data ONTAP has experienced a serious internal error: Process mgwd unresponsive for 225 seconds (mgwd startup: "(2357)"). This might cause the node experiencing the problem to become unresponsive to data access. The node has been panicked to prevent this condition from continuing.
Fri Jan 19 XX:XX:XX -0X00 [XXXXX: send_boot_msg_thread: mgr.stack.string:notice]: Panic string: Process mgwd unresponsive for 225 seconds (mgwd startup: "(2357)") in process nodewatchdog on release 9.10.1P12 (C) - After panic, giveback is successful and the node is healthy.