After a power outage Snapshots on Flexgroup missing with no manual delete attempt
Applies to
- ONTAP 9 cluster with more than 2 nodes
- Flexgroups in use with constituents spanning across one HA-Pair
- Flexgroup Snapshots
Issue
- Two nodes of one HA-Pair:
- own constituents of the same Flexgroup.
- reboot unexpectedly at the same time without clean shutdown (for example due to multi disk panic or power outage).
- boot up again at the same time.
- After the reboot, single or multiple Flexgroup Snapshot(s) are completely lost or show
-
fromsnapshot show
underSize Total% Used%
when only some constituents have lost the corresponding Flexgroup Snapshot:
::> set adv
Warning: These advanced commands are potentially dangerous; use them only when directed to do so by NetApp personnel.Do you want to continue? {y|n}: y
::*> vol show -vserver svm1 -volume MyFlexgroup1 -fields is-flexgroup
vserver volume is-flexgroup
------- ------------ ------------
svm1 MyFlexgroup1 true
::*> volume snapshot show -vserver svm1 -volume MyFlexgroup1
---Blocks---
Vserver Volume Snapshot Size Total% Used%
-------- -------- ------------------------------------- -------- ------ -----
svm1 MyFlexgroup1
MySnapshot1 - - -
hourly.2024-03-11_0905 360KB 0% 36%
2 entries were displayed.
1 entry was acted on.
::*> node run -node MyCluster-01 -command snap status MyFlexgroup1__0001
Node: MyCluster-01
Volume MyFlexgroup1__0001
snapid status date ownblks release fsRev name
------ ------ ------------ ------- ------- ----- --------
2 complete Mar 11 09:05 47 9.7 35092 hourly.2024-03-11_0905
1 complete Mar 11 09:00 47 9.7 35092 MySnapshot1
::*> node run -node MyCluster-02 -command snap status MyFlexgroup1__0002
Node: MyCluster-02
Volume MyFlexgroup1__0002
snapid status date ownblks release fsRev name
------ ------ ------------ ------- ------- ----- -------
2 complete Mar 11 09:05 47 9.7 35092 hourly.2024-03-11_0905
Note: MySnapshot1 is missing on constituent MyFlexgroup1__0002
::*> snapshot show -vserver svm1 -volume MyFlexgroup1 -snapshot MySnapshot1 -fields state
vserver volume snapshot state
------- ------------ ----------- -----
svm1 MyFlexgroup1 MySnapshot1 unknown
- An affected node can hit a panic after reboot as a result of excessive snapshot deletion:
Panic_Message: timeout table full in SK process snap_lopri_work on release 9.11.1P8