CONTAP-375336: ONTAP Cluster Node Reboots Unexpectedly While Collecting CM Statistics
Issue
ONTAP cluster node reboots unexpectedly while collecting Counter Manager statistics in getCryptRates() due to a hung process with the following stack trace:
#0 0xffffffff8a1a1f5d in dumpcore (pmsg=0xffffffff97807a90 <kmod_dumper.buf> "process on cpu16 hung (CCMA-Worker) for 5005 milliseconds! in SK process CCMA-Worker on release 9.14.1P5 (C)", force=0) at prod/common/core/core_dumper.c:2065
#1 0xffffffff8a1a771c in kmod_dumper (kdpanic=<optimized out>) at prod/kmod/common/core/kmod_core.c:502
#2 0xffffffff8073e2ec in call_on_new_stack (func=0xfffffe00834c3320, arg1=<optimized out>, new_sp=1) at src/core_stack.c:100
#3 dumpcore_on_dumpstack (dumpcoord=0xffffffff88b27e30 <ontap_dumpcore>, arg1=<optimized out>) at src/core_stack.c:135
#4 0xffffffff8057b7d0 in doadump (textdump=<optimized out>) at ../../../../src/sys/kern/kern_shutdown.c:671
#5 0xffffffff8057b62e in kern_reboot (howto=260) at ../../../../src/sys/kern/kern_shutdown.c:847
#6 0xffffffff8057bf63 in vpanic (fmt=0xffffffff81440f30 <sk_panic.buf> "process on cpu16 hung (CCMA-Worker) for 5005 milliseconds!", ap=0xfffffe00399bbd90) at ../../../../src/sys/kern/kern_shutdown.c:1490
#7 0xffffffff8057b892 in panic (fmt=0xffffffff97805290 <trap_regs> " 3L\203") at ../../../../src/sys/kern/kern_shutdown.c:1269
#8 0xffffffff8057c042 in sk_panic (fmt=<optimized out>) at ../../../../src/sys/kern/kern_shutdown.c:1518
#9 0xffffffff8c188258 in sk_timeout_cpuhog_mycpu () at src/sk_core/sk_cpuhog.c:401
#10 0xffffffff8038011e in statclock (cnt=1, usermode=0) at ../../../../src/sys/kern/kern_clock.c:707
#11 0xffffffff8037ed58 in handleevents (now=23435086424798096, fake=4) at ../../../../src/sys/kern/kern_clocksource.c:207
#12 timercb (et=<optimized out>, arg=<optimized out>) at ../../../../src/sys/kern/kern_clocksource.c:374
#13 0xffffffff803a5132 in lapic_handle_timer (frame=0xfffff6806106c770) at ../../../../src/sys/x86/x86/local_apic.c:1416
#14 0xffffffff80d226c8 in Xtimerint () at ../../../../src/sys/amd64/amd64/apic_vector.S:143
#15 csm_spin_lock (lockp=0x0) at src/Common/CsmSpinlocksFast.cc:79
#16 0xffffffff84375229 in CtConnection::getCryptRates (this=0xfffff801d1c82028, bs=@0x0: 85899345924, br=@0x0: 85899345924, us=@0x0: 85899345924, ur=@0xfffff6806106c8e0: 0) at src/Ct/CtConnection.cc:1470
#17 0xffffffff8438e758 in SpinNPSessionInt::getAvgCryptRates (this=0xfffffe0169eae028, totalByteSent=@0xfffff6806106d190: 0, totalByteReceived=@0xfffff6806106d198: 0, totalUsSent=@0xfffff6806106d178: 0, totalUsReceived=@0xfffff6806106d180: 0) at src/Session/session.cc:7148
#18 0xffffffff8437d837 in cm_obj_csm_session_update (offset_list=<optimized out>, instance_name=<optimized out>, instance_name_buflen=<optimized out>, instance_size=2408, curr_inst_cnt=0xfffff6806106d3d4, instance_data_p=<optimized out>, inst_start_p=0xfffff68062c4cdc8 "0x11062b980d4f2c00", inst_buf_remaining=1004136, _this_obj=<optimized out>, cnt_defs=0xffffffff8453ce00 <csm_session_cnt_defs>, request_type=Cm_Instance_Names_And_Data, priv_flag=<optimized out>, cm_kfilter=0x0, request_origin=<optimized out>, _next_tag_p=0xfffff6806106d3b8) at src/cm/cm_obj_csm_session.cc:131
#19 0xffffffff8c5340fb in cm_query_record (obj_id=1, buffer=0x0, current_instance_cnt_pt=0xfffff6806106d3d4, buffer_size=1047480, instance_name=0x0, instance_size=2408, cnt_ids=0xfffff814b812c3a0, priv_flag=0, query_type=Cm_Instance_Names_And_Data, origin=Cm_Origin_Archiver_Cmode, next_tag_pt=0xfffff6806106d3b8, pc=0) at src/cm_api.c:798
#20 0xffffffff8c9238ba in _kccma_collect_sample (env=0xfffff810e842c000, object=<optimized out>, period=300000, workerID=1, sample_timeout_ms=5456700994, next_tag_pt=<optimized out>) at src/ccma_worker.cc:502
#21 kccma_collector_worker_thread (arg1=-8723476856832, arg2=1) at src/ccma_worker.cc:955
#22 0xffffffff804b7566 in fork_exit (callout=0xffffffff8c923520 <kccma_collector_worker_thread(long, long)>, arg=0xfffff810e842c000, frame=0x1) at ../../../../src/sys/kern/kern_fork.c:1296