Skip to main content
NetApp Knowledge Base

NFS and iSCSI access failed with alert nodewatchdog.svc.rpc.noresp

Views:
326
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
core
Last Updated:

Applies to

  • ONTAP 9.7
  • Two node switchless cluster

Issue

  • Any client or Host connected over NFS or iSCSI, lost the connection to the NetApp storage
  • Unified Manager alert:

Node: Node-01
Time: Wed, Sep 23 18:24:46 2020 +0200
Severity: ALERT

Message: nodewatchdog.svc.rpc.noresp: The vifmgr service internal to Data ONTAP that is required for continuing data    service was unavailable. The service failed, but was unsuccessfully restarted.

Description: This message occurs when a service critical to data access fails to respond to service monitoring and is   restarted. Data ONTAP(R) might have experienced a serious error and might operate in a degraded mode.

Corrective Action: If the message reports that the service has "restarted" then no action is required. If the status    is "not restarted" or "unsuccessfully restarted," then reboot the node using the "system node reboot" command with the  "-dump true" option, and then contact NetApp technical support.

Source: nodewatchdog
Sequence#: 602147

  • RDB logs contain frequent "apparent starvation" messages like the following:

Example Vifmgr:

Wed Sep 23 2020 18:24:00 +02:00 [kern_vifmgr:info:6763] rdb::qm:Wed Sep 23 18:18:18 2020:src/rdb/quorum/qm_states/qm_state.cc:474 (thr_id:0x80b8d7100) QmState::doWorkerWork RDB QM main voter thread slept for 14303 ms (apparent starvation)
 

Example MGWD:

Wed Sep 23 2020 18:24:43 +02:00 [kern_mgwd:info:14611] A [src/rdb/quorum/qm_states/qm_state.cc 471 (0x82ddb7800)]: doWorkerWork: RDB QM main voter thread slept for 3711 ms (apparent starvation).

Wed Sep 23 2020 18:24:43 +02:00 [kern_mgwd:info:14611] A [src/rdb/cluster_events.cc 88 (0x82ddb7800)]: Report: Cluster event: node-event, epoch 36, site 1001 [apparent starvation in QM main voter thread: slept 3711 ms].

Example VLDB:

Wed Sep 23 2020 18:23:38 +02:00 [kern_vldb:info:6843] rdb::qm:Wed Sep 23 18:21:01 2020:src/rdb/quorum/qm_states/qm_state.cc:474 (thr_id:0x80ab31900) QmState::doWorkerWork RDB QM main voter thread slept for 4175 ms (apparent starvation)

Wed Sep 23 2020 18:25:15 +02:00 [kern_vldb:info:6843] A [src/rdb/quorum/qm_states/qm_state.cc 471 (0x80ab31900)]: doWorkerWork: RDB QM main voter thread slept for 2498 ms (apparent starvation).

Example Bcomd:

Wed Sep 23 2020 18:24:49 +02:00 [kern_bcomd:info:6802] rdb::qm:Wed Sep 23 18:22:25 2020:src/rdb/quorum/qm_states/qm_state.cc:474 (thr_id:0x80b757700) QmState::doWorkerWork RDB QM main voter thread slept for 7637 ms (apparent starvation)

Wed Sep 23 2020 18:24:49 +02:00 [kern_bcomd:info:6802] A [src/rdb/quorum/qm_states/qm_state.cc 471 (0x80b757700)]: doWorkerWork: RDB QM main voter thread slept for 116272 ms (apparent starvation).

Wed Sep 23 2020 18:24:49 +02:00 [kern_bcomd:info:6802] A [src/rdb/cluster_events.cc 88 (0x80b757700)]: Report: Cluster event: node-event, epoch 18, site 1001 [apparent starvation in QM main voter thread: slept 116272 ms].

  • MGWD log also include frequent SMF trace dumps for the table client_device_view:

Fri Sep 23 2020 18:15:47 +02:00 [kern_mgwd:info:60301] client_device_view.next_imp() returns: success (latency: 60.256860s) [-,switch-01(DCOV13007),10.10.10.10,SNMPv2c,true,-,cshm1!,OTHER,management-network,Cisco Nexus Operating System (NX-OS) Software, Version 14.1(2u),Invalid SNMP Settings,CDP/ISDP,-,false,Unknown,yes,no,OTHER] (memory: 434290 net, 445502 max, 5447768 allocated, 5013478 freed, 28040 allocations)

Note: This is the last line of the SMF trace dump. The full trace dump is too large to include here.

  • Elevated CPU utilization could be seen from HostOS Domain. This however may not be the top contributor.

netapp::*> node run -node netapp-01 sysstat -M 1
ANY1+ ANY2+ ANY3+ ANY4+ ANY5+ ANY6+ ANY7+ ANY8+ ANY9+ ANY10+ ANY11+ ANY12+ ANY13+ ANY14+ ANY15+ ANY16+  AVG
 100%  100%  100%   99%   98%   96%   94%   91%   86%    81%    76%    70%    64%    57%    48%   37%   81%

CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15
 78%  76%  77%  83%  82%  83%  82%  82%  82%  82%   83%   84%   83%   82%   83%   82%

Nwk_Excl Nwk_Lg Nwk_Exmpt Protocol Cluster Storage Raid Raid_Ex Target Kahuna WAFL_Ex(Kahu)
      3%     2%      450%       0%      0%     49%   2%    136%     0%     4%    511%( 94%)

WAFL_XClean SM_Exempt Cifs Exempt SSAN_Ex Intr Host  Ops/s   CP
         0%        0%   0%   112%      0%  28%  201%  47111   0%

  • The high HostOS utilization is driven out of mgwd & vifmgr processes.

 ::*> systemshell -node <NodeName> top
 ::*> systemshell -node <NodeName> ps auxww | head

PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
6379 root           107  40    0   194M 40272K uwait   1 117.2H  59.47% vifmgr
2169 root           311  67    0   658M   148M uwait   1 174.3H   8.64% mgwd

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

Scan to view the article on your device