Openstack: Cinder scheduler uses stale statistics in Active-Active deployments
Applies to
- Openstack (all releases)
Issue
Cinder scheduler does not make use of a reliable "single source of truth" such as an etcd database. As such:
- backend statistics that include
total_capacity_gb
,provisioned_capacity_gb
,max_over_subscription_ratio
all reside in memory on each Cinder scheduler. - Volume nodes will periodically send backend statistic updates to the Cinder scheduler nodes
- However a race condition is possible where:
- Cinder volumes are created within a few seconds of each other
- Cinder scheduler statistics are not updated before a new volume create request is received
- This allows cinder volumes to be created on backing storage pools that should have been filtered by Cinder scheduler (provided that Cinder scheduler had up to data statistics)
- This could include scenarios where
max_over_subscription_ratio
is ignored, allowing a backing storage pool to be over provisioned beyond the allowed limit set bymax_over_subscription_ratio
- This could also include a scenario where a nearly full storage pool is selected over a pool with more available space.
- This could include scenarios where
This issue impacts both the generic and NetApp cinder drivers.