Node Offline due to failing NVRAM card with 'hung_task' and 'hung_task_timeout_secs'
Applies to
SolidFire AFA: SF19210
Issue
- Node went offline, prior to it sf-master.info shows the following
2023-04-29T18:44:47.632229Z SFALPSF08 master-1[26751]: [APP-5] [Leader] 28567 CMIscsiConnectMo serviceshared/LeaderCoordinator.cpp:618:OnClusterMasterConnectCallback|Full vote, based on connection states shouldVote=1 stateVote=1 sequenceNumber=143 nodesWithWorkingEAContainers={57,72,86,126,154,155,185,199}
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
- dmesg -T shows "hung_task_timeout_sec" on nvme0n1
crash> dmesg -T [Sat Apr 29 18:49:04 UTC 2023] INFO: task jbd2/nvme0n1-8:26613 blocked for more than 120 seconds. [Sat Apr 29 18:49:04 UTC 2023] Tainted: G O 4.19.37-solidfire8 #1 [Sat Apr 29 18:49:04 UTC 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
- Following multiple core dumps got generated after the crash
-rw-rw-rw- 1 dexterap engr 76763717096 Apr 29 12:20 dump.202304291845
-rw-rw-rw- 1 dexterap engr 776107259 Apr 29 12:32 dump.202304291928
- Core files shows multiple kernel panic on the nvram card "nvme0n1"
KERNEL: /sf_debug/12.3.2.3/lib64/modules/4.19.37-solidfire8/vmlinux-ember-x86_64-4.19.37-solidfire8 DUMPFILE: dump.202304291845 [PARTIAL DUMP] CPUS: 56 DATE: Sat Apr 29 18:45:09 UTC 2023 UPTIME: 380 days, 21:16:56 LOAD AVERAGE: 3.68, 3.95, 4.22 TASKS: 3273 NODENAME: QALPOGSF08 RELEASE: 4.19.37-solidfire8 VERSION: #1 SMP Mon Aug 17 14:34:57 UTC 2020 MACHINE: x86_64 (2600 Mhz) MEMORY: 383.9 GB PANIC: "Kernel panic - not syncing: hung_task: blocked tasks" PID: 299 COMMAND: "khungtaskd" TASK: ffff8f9c77b71d80 [THREAD_INFO: ffff8f9c77b71d80] CPU: 22 STATE: TASK_RUNNING (PANIC) [32908851.679379] INFO: task jbd2/nvme0n1-8:26613 blocked for more than 120 seconds. [32908852.259911] Kernel panic - not syncing: hung_task: blocked tasks