Backup logs from aborted and/or resumed NDMP operations can cause an ONTAP node's root volume to fill, possibly leading to node panics

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 712

Visibility:: Public

Votes:: 0

Category:: ndmp

Specialty:: dp

Last Updated:

Applies to

ONTAP 9
Network Data Management Protocol (NDMP) operations, such as ndmpcopy

Issue

Rapid increase in the used size of a single node's root volume. This can be seen by running the following command periodically:

cluster1::> volume show -vserver cluster1-01 Vserver Volume Aggregate State Type Size Available Used% --------- ------------ ------------ ---------- ---- ---------- ---------- ----- cluster1-01 vol0 aggr0 online RW 442.4GB 407.6GB 7%

(Using a node name as the -vserver parameter will return that node's root volume)

The backup log located at /mroot/etc/log/backup is filled with messages similar to the following:

Tue Mar 27 00:11:36 EDT 2018 /svm1/vol1 Log_msg (Flush DIRNET for BKP ID=248, type=3 interrupted while waiting for min inflight. Error = Interrupted system call.

The simplest way to access the backup log is through the Service Processor Infrastructure (SPI) interface by clicking the logs link. See KB: How to manually collect logs and copy files from a clustered Data ONTAP storage system (under "Option 1") for assistance working with the SPI.

Affected node may panic with messages similar to the following:

Example 1:

Process vldb unresponsive for 631 seconds in process nodewatchdog onrelease 9.2P1 (C)

Note: This panic may be caused by many other issues. This panic alone does not indicate the issue outlined here; make sure to check the node's root volume status as well as the contents of the backup log.

Example 2:

Apr 12 15:49:43 [node-02:callhome.mdb.recovery.unsuccessful:EMERGENCY]: Call home for MDB RECOVERY UNSUCCESSFUL FOR THE coresegd WARNING. Apr 12 15:51:58 [node-02:callhome.mdb.recovery.unsuccessful:EMERGENCY]: Call home for MDB RECOVERY UNSUCCESSFUL FOR THE mcached WARNING. Apr 12 15:54:07 [node-02:spm.vifmgr.process.exit:EMERGENCY]: Logical Interface Manager(VifMgr) with ID 9996 aborted as a result of signal normal exit (1). The subsystem will attempt to restart. Apr 12 15:54:09 [node-02:callhome.mdb.recovery.unsuccessful:EMERGENCY]: Call home for MDB RECOVERY UNSUCCESSFUL FOR THE vifmgr WARNING. Apr 12 16:03:14 [node-02:callhome.mdb.recovery.unsuccessful:EMERGENCY]: Call home for MDB RECOVERY UNSUCCESSFUL FOR THE bcomd WARNING. PANIC : Process vifmgr unresponsive for 630 seconds version: 9.4P3: Thu Oct 11 18:25:55 EDT 2018 conf : x86_64.optimize cpuid = 3 KDB: stack backtrace: PANIC: Process vifmgr unresponsive for 630 seconds in process nodewatchdog on release 9.4P3 (C) on Wed Apr 12 16:04:13 KST 2023 Apr 12 16:21:11 [node-02:extCache.rw.replay.canceled:notice]: WAFL external cache replay canceled for aggregate node2_aggr0: Aggregate came online after timeout. Apr 12 16:22:21 [node-02:mgmtgwd.rootvolrec.low.space:EMERGENCY]: The root volume on node "node-02" is dangerously low on space. Less than 10 MB of free space remaining. Apr 12 16:22:21 [node-02:callhome.root.vol.recovery.reqd:EMERGENCY]: Call home for ROOT VOLUME NOT WORKING PROPERLY: RECOVERY REQUIRED.

Backup log growth causes root volume out of space, sometimes causing root aggregate offline.

214G /mroot/etc/log/backup 96G /mroot/etc/log/backup.0