Backup logs from aborted and/or resumed NDMP operations can cause an ONTAP node's root volume to fill, possibly leading to node panics
Applies to
- ONTAP 9
- Network Data Management Protocol (NDMP) operations, such as
ndmpcopy
Issue
- Rapid increase in the used size of a single node's root volume. This can be seen by running the following command periodically:
cluster1::> volume show -vserver cluster1-01
Vserver Volume Aggregate State Type Size Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
cluster1-01 vol0 aggr0 online RW 442.4GB 407.6GB 7%
(Using a node name as the -vserver
parameter will return that node's root volume)
- The backup log located at
/mroot/etc/log/backup
is filled with messages similar to the following:
Tue Mar 27 00:11:36 EDT 2018 /svm1/vol1 Log_msg (Flush DIRNET for BKP ID=248, type=3 interrupted while waiting for min inflight. Error = Interrupted system call.
The simplest way to access the backup
log is through the Service Processor Infrastructure (SPI) interface by clicking the logs
link. See KB: How to manually collect logs and copy files from a clustered Data ONTAP storage system (under "Option 1") for assistance working with the SPI.
- Affected node may panic with messages similar to the following:
Example 1:
Process vldb unresponsive for 631 seconds in process nodewatchdog onrelease 9.2P1 (C)
Note: This panic may be caused by many other issues. This panic alone does not indicate the issue outlined here; make sure to check the node's root volume status as well as the contents of the backup log.
Example 2:
Apr 12 15:49:43 [node-02:callhome.mdb.recovery.unsuccessful:EMERGENCY]: Call home for MDB RECOVERY UNSUCCESSFUL FOR THE coresegd WARNING.
Apr 12 15:51:58 [node-02:callhome.mdb.recovery.unsuccessful:EMERGENCY]: Call home for MDB RECOVERY UNSUCCESSFUL FOR THE mcached WARNING.
Apr 12 15:54:07 [node-02:spm.vifmgr.process.exit:EMERGENCY]: Logical Interface Manager(VifMgr) with ID 9996 aborted as a result of signal normal exit (1). The subsystem will attempt to restart.
Apr 12 15:54:09 [node-02:callhome.mdb.recovery.unsuccessful:EMERGENCY]: Call home for MDB RECOVERY UNSUCCESSFUL FOR THE vifmgr WARNING.
Apr 12 16:03:14 [node-02:callhome.mdb.recovery.unsuccessful:EMERGENCY]: Call home for MDB RECOVERY UNSUCCESSFUL FOR THE bcomd WARNING.
PANIC : Process vifmgr unresponsive for 630 seconds
version: 9.4P3: Thu Oct 11 18:25:55 EDT 2018
conf : x86_64.optimize
cpuid = 3
KDB: stack backtrace:
PANIC: Process vifmgr unresponsive for 630 seconds in process nodewatchdog on release 9.4P3 (C) on Wed Apr 12 16:04:13 KST 2023
Apr 12 16:21:11 [node-02:extCache.rw.replay.canceled:notice]: WAFL external cache replay canceled for aggregate node2_aggr0: Aggregate came online after timeout.
Apr 12 16:22:21 [node-02:mgmtgwd.rootvolrec.low.space:EMERGENCY]: The root volume on node "node-02" is dangerously low on space. Less than 10 MB of free space remaining.
Apr 12 16:22:21 [node-02:callhome.root.vol.recovery.reqd:EMERGENCY]: Call home for ROOT VOLUME NOT WORKING PROPERLY: RECOVERY REQUIRED.
- Backup log growth causes root volume out of space, sometimes causing root aggregate offline.
214G /mroot/etc/log/backup
96G /mroot/etc/log/backup.0