Skip to main content
NetApp Knowledge Base

Node Offline due to failing NVRAM card with 'hung_task' and 'hung_task_timeout_secs'

Views:
80
Visibility:
Public
Votes:
0
Category:
element-software
Specialty:
solidfire
Last Updated:

Applies to

SolidFire AFA: SF19210

Issue

  • Node went offline, prior to it sf-master.info shows the following 

2023-04-29T18:44:47.632229Z SFALPSF08 master-1[26751]: [APP-5] [Leader] 28567 CMIscsiConnectMo serviceshared/LeaderCoordinator.cpp:618:OnClusterMasterConnectCallback|Full vote, based on connection states shouldVote=1 stateVote=1 sequenceNumber=143 nodesWithWorkingEAContainers={57,72,86,126,154,155,185,199}
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

  • dmesg -T shows "hung_task_timeout_sec" on nvme0n1
crash> dmesg -T
[Sat Apr 29 18:49:04 UTC 2023] INFO: task jbd2/nvme0n1-8:26613 blocked for more than 120 seconds.
[Sat Apr 29 18:49:04 UTC 2023]       Tainted: G           O      4.19.37-solidfire8 #1
[Sat Apr 29 18:49:04 UTC 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  • Following multiple core dumps got generated after the crash

   -rw-rw-rw- 1 dexterap engr 76763717096 Apr 29 12:20 dump.202304291845
       -rw-rw-rw- 1 dexterap engr 776107259 Apr 29 12:32 dump.202304291928

  • Core files shows multiple kernel panic on the nvram card "nvme0n1"
KERNEL: /sf_debug/12.3.2.3/lib64/modules/4.19.37-solidfire8/vmlinux-ember-x86_64-4.19.37-solidfire8
    DUMPFILE: dump.202304291845  [PARTIAL DUMP]
        CPUS: 56
        DATE: Sat Apr 29 18:45:09 UTC 2023
      UPTIME: 380 days, 21:16:56
LOAD AVERAGE: 3.68, 3.95, 4.22
       TASKS: 3273
    NODENAME: QALPOGSF08
     RELEASE: 4.19.37-solidfire8
     VERSION: #1 SMP Mon Aug 17 14:34:57 UTC 2020
     MACHINE: x86_64  (2600 Mhz)
      MEMORY: 383.9 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
         PID: 299
     COMMAND: "khungtaskd"
        TASK: ffff8f9c77b71d80  [THREAD_INFO: ffff8f9c77b71d80]
         CPU: 22
       STATE: TASK_RUNNING (PANIC)

[32908851.679379] INFO: task jbd2/nvme0n1-8:26613 blocked for more than 120 seconds.
[32908852.259911] Kernel panic - not syncing: hung_task: blocked tasks

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.