Azure CVO panics due to WAFL inconsistencies
Applies to
- Cloud Volumes ONTAP (CVO)
- Microsoft Azure
Issue
The root volume for an single node Azure CVO is in a WAFL inconsistent state, causing a node to operate on AUTOROOT pending recovery:
Fri Sep 19 07:12:00 +0000 [CVO_System-02: wafl_exempt13: sk.panic:alert]: Panic String: Unrecoverable metadata block (file -1, block 21198419, fbn 23246, level 0, file type 1) in volume vol0. WAFL inconsistent. Contact NetApp technical support. in SK process wafl_exempt13 on release 9.16.1P5 (C)Cause
- CVO systems deployed with Azure page blob disks are susceptible to I/O throttling from Azure, resulting in “AZURE_SERVER_BUSY” responses
- This throttling can lead to delayed or failed disk operations, causing:
- Loss of access to critical ONTAP metadata blocks
- WAFL inconsistencies (especially in root or system volumes)
- Controller failover issues and node panic events
- Bad block and checksum verification errors
- Error seen in server logs:
Mon Sep 15 14:27:08 +0000 [jazsacvo100c-01: OscHighPriThreadPool_126: pha.obj.throttle:notice]: PHA Obj throttled 7393 times on the blob 01RootBlob-cvo_systemin the container blobcontainer due to the "AZURE_SERVER_BUSY".
Solution
- Contact Azure Support to investigate the issue
- Redeploy the CVO to migrate the instance to new hardware in Azure's data center
- If this is related to the Azure CVO being deployed with page blobs, the following can be done to mitigate this in the future:
- Migrate all data to a new CVO deployment that is using Azure managed disks
- Managed disk-based CVO systems are not subject to “AZURE_SERVER_BUSY” throttling and provide improved performance and reliability
- This requires deploying a new CVO instance with managed disks and using SnapMirror or other migration tools to move data
- Migrate all data to a new CVO deployment that is using Azure managed disks
- If this is related to the Azure CVO being deployed with page blobs, the following can be done to mitigate this in the future:
Partner Notes
partnerNotes_text
Additional Information
- Page blob disks have lower IOPS and throughput limits than managed disks
- Throttling is more likely under high load or with large datasets
- Upsizing the VM instance may not resolve the issue, as page blob limits are at the storage account and disk level
- See: CVO architecture: Page Blob vs Managed Disks
Internal Notes
internalNotes_text
