Root aggregate WAFL inconsistent during hardware replacement
Applies to
- ONTAP 9
- AFF A320
Issue
- During maintenance involving failover, node boots to waiting for giveback prompt where Ctrl+C is entered followed by "no" and "yes" to the prompts:
Waiting for giveback...(Press Ctrl-C to abort wait)
This node was previously declared dead.
Pausing to check HA partner status ...
partner is operational and in takeover mode.
You must initiate a giveback or shutdown on the HA
partner in order to bring this node online.
The HA partner is currently operational and in takeover mode.This node cannot continue unless you initiate a giveback on the partner.
Once this is done this node will reboot automatically.
waiting for giveback...
Do you wish to halt this node rather than wait [y/n]? n
The HA partner appears to be either not operational or not in takeover
mode. You will be asked whether you want to continue. If you answer "yes", the
existing failover monitor disk state will be overwritten and this node will be
rebooted. Answering "no" will halt this node with no modification to the failover
monitor disk state.
WARNING: Answering "yes" while the HA partner is operational and in
takeover mode will have unexpected and potentially catastrophic results:
YOUR FILESYSTEMS MAY BE DESTROYED
Do you wish to continue [y/n]? y
Oct 01 12:07:31 [cluster-02:cf.fm.overwriteState:notice]: System continuing after overwriting failover monitor state!
- The taken over node will reboot and potentially panic:
Warning: previous shutdown was dirty, there is a possible loss of data.
Oct 01 12:11:04 [cluster-02:wafl.root.content.changed:error]: Contents of the root volume '' might have changed. Verify that all recent configuration changes are still in effect.
PANIC : NVRAM contents are invalid...
- After panic, node reboots back to ONTAP login prompt but repeatedly halts:
SP-login: login: HALT: HA partner has taken over (ic) on Sun Oct 1 12:35:34 CDT 2023
- Later, the up node panics due to WAFL metadata inconsistency in the taken over node's root volume:
Sun Oct 01 13:27:50 -0500 [cluster-02: wafl_exempt17: sk.panic:alert]: Panic String: Unrecoverable metadata block (file xxxx, block xxxxxxx, fbn xxxxxxx, level 1, file type 16) in aggregate partner:cluster02_root. WAFL inconsistent. Contact NetApp technical support.
- The taken over node, previously halting if booted, now panics instead on boot attempts:
PANIC : Msg execution failed during replay, vol=vol0, msg=0xfffff70067600100, type=WAFL_WRITE, errno=192, replay_idx=1, coalesced=0 coalesced_cnt=63