CONTAP-497562: Unexpected reboot when a large number of cluster sessions are rapidly created and destroyed
Issue
A rare race condition in cluster connection handling can cause an unexpected reboot if a delayed response to a closed connection is matched to a new session with the same identifier. The likelihood of this issue increases during rapid session creation and teardown, such as after a network outage.
Example Backtrace:
#0 0xffffffff8a7fd701 in dumpcore (pmsg=0xffffffff93f0abc0 <kmod_dumper.buf> "page fault (supervisor read data, page not present) on VA 0x57649 cs:rip 0x20:0xffffffff844491b3 rflags 0x10206 in process CSM_BTLS on release 9.15.1P2 (C)", force=0) at prod/common/core/core_dumper.c:2066
#1 0xffffffff8a807f98 in kmod_dumper (kdpanic=<optimized out>) at prod/kmod/common/core/kmod_core.c:502
#2 0xffffffff80a0fddc in call_on_new_stack (func=0xfffffe01d1522c00, arg1=<optimized out>, new_sp=4) at src/core_stack.c:100
#3 dumpcore_on_dumpstack (dumpcoord=0xffffffff88f69290 <ontap_dumpcore>, arg1=<optimized out>) at src/core_stack.c:135
#4 0xffffffff804d5590 in doadump (textdump=<optimized out>) at ../../../../src/sys/kern/kern_shutdown.c:677
#5 0xffffffff804d53ee in kern_reboot (howto=260) at ../../../../src/sys/kern/kern_shutdown.c:854
#6 0xffffffff804d5e29 in vpanic (fmt=0xffffffff80dfa62d "page fault (%s %s %s, %s) on VA %#lx cs:rip %#lx:%#lx rflags %#lx", ap=0xfffffe0370a5ccd0) at ../../../../src/sys/kern/kern_shutdown.c:1513
#7 0xffffffff804d5652 in panic (fmt=0xffffffff93f02fd0 <trap_regs> "") at ../../../../src/sys/kern/kern_shutdown.c:1276
#8 0xffffffff80be4340 in trap_fatal (frame=0xfffffe0370a5ce90, eva=357961) at ../../../../src/sys/amd64/amd64/trap.c:1159
#9 0xffffffff804a73fd in trap (frame=<optimized out>) at ../../../../src/sys/amd64/amd64/trap.c:398
#10 0xffffffff80c6daef in <signal handler called> () at ../../../../src/sys/amd64/amd64/exception.S:317
#11 SpinNPSessionInt::isInterCluster (this=0x0) at src/include/Session/session.h:1469
#12 CsmActiveOpenUser::activeOpenRsp (this=0xfffff80c20a1a4a8, errorCode=CSM_EXISTS, connection=0xfffff81aaaafa828) at src/Csm/CsmActiveOpenUser.cc:349
#13 0xffffffff84465fb4 in CtActiveOpen::activeOpenRsp (this=0xfffff80938751828, errorCode=CSM_EXISTS) at src/Ct/CtConnection.cc:364
#14 0xffffffff84469ac6 in CtState::triggerActiveOpenErrorRsp (conn=<optimized out>, error=CSM_EXISTS, this=<optimized out>) at src/Ct/CtState.cc:151
#15 CtBindSent::rcv (this=<optimized out>, conn=<optimized out>, gbuf=<optimized out>, ctHdr=..., recordLen=20, gItemLen=<optimized out>, err=0xfffffe0370a5d1e4) at src/Ct/CtState.cc:336
#16 0xffffffff847166e3 in non-virtual thunk to CtConnection::loRcv(GbufTable_t*, GbufTable_t*) () at src/Ct/CtConnection.cc:1099
#17 0xffffffff8447a3f9 in CtLoSocket::Connection::receiveMbufs (this=0xfffff80f512e0828, m=0xfffff80c24281e00) at src/CtLo/CtLoSocket.cc:3611
#18 0xffffffff8446edc8 in CtLoBtlsFilter::recvProcB (this=0xfffff8089a10baa8, bytes=20) at src/CtLo/CtLoBtlsFilter.cc:212
#19 0xffffffff844deaae in BTLS::handleUpcalls (this=<optimized out>, upcalls=10) at src/BTLS.cc:351
#20 0xffffffff844e38f8 in TLS::ThreadPool::threadBody (this=0xfffff80245117100, me=0xfffff80297ea3300) at src/TLS.cc:558
#21 0xffffffff84714ac0 in util::Thread::_body (p=0xfffff80297ea3300) at common_util/Thread.cc:181
#22 0xffffffff805a8486 in fork_exit (callout=0xffffffff84714ab0 <util::Thread::_body(void*)>, arg=0xfffff80297ea3300, frame=0xfffffe0370a5d480) at ../../../../src/sys/kern/kern_fork.c:1315