Is my controller overloaded?
Applies to
ONTAP 9
Answer
- The answer to this question can be determined from Resource Headroom statistics for a quick measurement or Performance Capacity from Active IQ Unified Manager
- Resource headroom statistics include details of utilization, operations and latency in the context of headroom guidance for a particular resource to give:
- Workload placement planning
- Workload balancing
- Visibility of resource performance capacity
- Help identify workloads which are too high for a given node.
Resource Headroom
- The ONTAP resource headroom object statistics facilitate understanding resource utilization and available headroom for CPU and aggregate resources.
- For CPU resources:
resource_headroom_cpu
. - For storage aggregate resources:
resource_headroom_aggr
.
- For CPU resources:
- The
current_[ops|latency|utilization]
and respectiveoptimal_point_*
counters provide point-in-time statistics of current utilization vs optimal points- The
optimal_point
is the point, where an increase in utilization or workload results in a disproportionately higher increase in latency. - From these counters, physical headroom or performance capacity can be calculated
- Physical headroom is the difference between the current utilization and the optimal point
- If current utilization exceeds the optimal point then the resource is considered "overloaded."
- The confidence factor is used to gage the accuracy of the optimal point for the given resource.
- Denoted by the following values:
- 1 - Low - Seed value is used for optimal point. There's not enough data to predict optimal point.
- 2 - Medium - Some data to extrapolate optimal point.
- 3 - High - Substantial data which reaches or exceeds optimal point, thereby the "optimal point" is known.
- 0 - Unknown - The resource is not available or is not in use, or there's an internal error such that the data cannot be retrieved.
- Denoted by the following values:
- The
Example: Viewing headroom statistics for a node where CPU and aggregate resources are exceeded
cluster::> set -privilege advanced cluster::*> statistics start -object resource_headroom_cpu|resource_headroom_aggr cluster::*> statistics show -object resource_headroom_cpu -counter current_ops|current_latency|current_utilization|optimal_point_latency|optimal_point_ops|optimal_point_utilization|optimal_point_confidence_factor Object: resource_headroom_cpu Instance: CPU_node_2 Start-time: 6/17/2020 12:31:57 End-time: 6/17/2020 13:31:57 Elapsed-time: 3600s Scope: node_2 Counter Value -------------------------------- -------------------------------- current_ops 1506 optimal_point_ops 1264 current_latency 3761 optimal_point_latency 1446 current_utilization 82 optimal_point_utilization 57 optimal_point_confidence_factor 3 cluster::*> statistics show -object resource_headroom_aggr -counter current_ops|current_latency|current_utilization|optimal_point_latency|optimal_point_ops|optimal_point_utilization|optimal_point_confidence_factor Object: resource_headroom_aggr Instance: DISK_HDD_node_2_aggr1_fb7a0d4f-9d65-4211-b651-b4cd422ee11d Start-time: 6/17/2020 12:37:57 End-time: 6/17/2020 13:37:57 Elapsed-time: 3600s Scope: node_2 Counter Value -------------------------------- -------------------------------- current_ops 1488 optimal_point_ops 1156 current_latency 38924 optimal_point_latency 28913 current_utilization 67 optimal_point_utilization 52 optimal_point_confidence_factor 3
- Higher time-frame resource statistics are available from Active-IQ performance dashboards, which are more useful for capacity planning.
- The
peak_performance
metric in the graphs represents theoptimal_point_utilization
counter from the resource_headroom statistics.- Further details of Active-IQ performance graphs
Workload Utilization
- Details of how much of a given resource can be determined with the use of the workload or qos statistics
- Qos statistics can provide a point-in-time statistics of resource utilization of workloads, on a per-node basis
Example: volume vol4test
is a heavy consumer of both CPU and aggregate resources.
cluster::> qos statistics volume resource cpu show -node node_1 Workload ID CPU --------------- ----- ----- -total- (400%) - 69% vol4test-wid2.. 23350 69% -total- (400%) - 73% vol4test-wid2.. 23350 73% -total- (400%) - 73% vol4test-wid2.. 23350 73% -total- (400%) - 70% vol4test-wid2.. 23350 70% cluster::> qos statistics volume resource disk show -node node_1 Workload ID Disk Number of HDD Disks Disk Number of SSD Disks --------------- ------ ------ ------------------- ------ ------------------- -total- - 32% 26 0% 0 vol4test-wid2.. 23350 92% 9 0% 0 -total- - 33% 26 0% 0 vol4test-wid2.. 23350 96% 9 0% 0 -total- - 33% 26 0% 0 vol4test-wid2.. 23350 97% 9 0% 0 -total- - 31% 26 0% 0 vol4test-wid2.. 23350 91% 9 0% 0
Additional Information
- Use of the node shell wafltop command can also be used to help identify which volumes/workload are the biggest consumers of various resources
- What is Performance Capacity