CONTAP-647685: Node reboots unexpectedly due to zombie process accumulation exhausting the BSD process limit
Issue
- Repeated calls via the "network ping" CLI or REST API cause zombie processes to accumulate
- When the count reaches the BSD process limit, user space processes fail to spawn new threads, eventually triggering a node reboot.
- Examples of user space core events and the eventual reboot by the node watchdog process:
Sat Feb 14 16:32:14 +0800 [cluster-02: vifmgr: ucore.panicString:error]: 'vifmgr: Call to pthread_create() failed with error: Cannot allocate memory, raising {{SIGABRT(6) at RIP 0x80936cd0a (pid 6922, uid 0, timestamp 1771057934)'}}Sat Feb 14 17:25:00 +0800 [cluster-02: cphmd: ucore.panicString:error]: 'cphmd: Call to pthread_create() failed with error: Cannot allocate memory, raising SIGABRT(6) at RIP 0x807a74d0a (pid 55070, uid 0, timestamp 1771061101)'...Sat Feb 14 17:24:24 +0800 [cluster-02: nodewatchdog: nodewatchdog.node.panic:alert]: Data ONTAP has experienced a serious internal error: Process vifmgr unresponsive for 163 seconds. This might cause the node experiencing the problem to become unresponsive to data access. The node has been panicked to prevent this condition from continuing.Sat Feb 14 17:24:24 +0800 [cluster-02: nodewatchdog: sk.panic:alert]: Panic String: Process vifmgr unresponsive for 163 seconds in process nodewatchdog on release 9.12.1P18 (C)