Is my controller overloaded?

Last updated

May 27, 2025
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 4,282

Visibility:: Public

Votes:: 3

Category:: ontap-9

Specialty:: perf

Last Updated:: 5/27/2025, 10:00:59 AM

Applies to

ONTAP 9

Answer

The answer to this question can be determined from Resource Headroom statistics for a quick measurement or Performance Capacity from Active IQ Unified Manager
Resource headroom statistics include details of utilization, operations and latency in the context of headroom guidance for a particular resource to give:
- Workload placement planning
- Workload balancing
- Visibility of resource performance capacity
- Help identify workloads which are too high for a given node.

Resource Headroom

The ONTAP resource headroom object statistics facilitate understanding resource utilization and available headroom for CPU and aggregate resources.
- For CPU resources: resource_headroom_cpu.
- For storage aggregate resources: resource_headroom_aggr.

The current_[ops|latency|utilization] and respective optimal_point_* counters provide point-in-time statistics of current utilization vs optimal points
- The optimal_point is the point, where an increase in utilization or workload results in a disproportionately higher increase in latency.
- From these counters, physical headroom or performance capacity can be calculated
  - Physical headroom is the difference between the current utilization and the optimal point
  - If current utilization exceeds the optimal point then the resource is considered "overloaded."
- The confidence factor is used to gage the accuracy of the optimal point for the given resource.
  - Denoted by the following values:
    - 1 - Low - Seed value is used for optimal point. There's not enough data to predict optimal point.
    - 2 - Medium - Some data to extrapolate optimal point.
    - 3 - High - Substantial data which reaches or exceeds optimal point, thereby the "optimal point" is known.
    - 0 - Unknown - The resource is not available or is not in use, or there's an internal error such that the data cannot be retrieved.

Example: Viewing headroom statistics for a node where CPU and aggregate resources are exceeded

cluster::> set -privilege advanced
cluster::*> statistics start -object resource_headroom_cpu|resource_headroom_aggr
cluster::*> statistics show -object resource_headroom_cpu -counter current_ops|current_latency|current_utilization|optimal_point_latency|optimal_point_ops|optimal_point_utilization|optimal_point_confidence_factor

Object: resource_headroom_cpu
Instance: CPU_node_2
Start-time: 6/17/2020 12:31:57
End-time: 6/17/2020 13:31:57
Elapsed-time: 3600s
Scope: node_2

    Counter                                                     Value
    -------------------------------- --------------------------------
                         current_ops                             1506
                   optimal_point_ops                             1264
                     current_latency                             3761
               optimal_point_latency                             1446
                 current_utilization                               82
           optimal_point_utilization                               57
     optimal_point_confidence_factor                                3

        
cluster::*> statistics show -object resource_headroom_aggr -counter current_ops|current_latency|current_utilization|optimal_point_latency|optimal_point_ops|optimal_point_utilization|optimal_point_confidence_factor

Object: resource_headroom_aggr
Instance: DISK_HDD_node_2_aggr1_fb7a0d4f-9d65-4211-b651-b4cd422ee11d
Start-time: 6/17/2020 12:37:57
End-time: 6/17/2020 13:37:57
Elapsed-time: 3600s
Scope: node_2

    Counter                                                     Value
    -------------------------------- --------------------------------
                         current_ops                             1488
                   optimal_point_ops                             1156
                     current_latency                            38924
               optimal_point_latency                            28913
                 current_utilization                               67  
           optimal_point_utilization                               52  
     optimal_point_confidence_factor                                3

Higher time-frame resource statistics are available from Active-IQ performance dashboards, which are more useful for capacity planning.
The peak_performance metric in the graphs represents the optimal_point_utilization counter from the resource_headroom statistics.
- Further details of Active-IQ performance graphs

Is my controller overloaded

Workload Utilization

Details of how much of a given resource can be determined with the use of the workload or qos statistics
- Qos statistics can provide a point-in-time statistics of resource utilization of workloads, on a per-node basis

Example: volume vol4test is a heavy consumer of both CPU and aggregate resources.

cluster::> qos statistics volume resource cpu show -node node_1
Workload           ID   CPU 
--------------- ----- ----- 
-total- (400%)      -   69% 
vol4test-wid2.. 23350   69% 
-total- (400%)      -   73% 
vol4test-wid2.. 23350   73% 
-total- (400%)      -   73% 
vol4test-wid2.. 23350   73% 
-total- (400%)      -   70% 
vol4test-wid2.. 23350   70% 

cluster::> qos statistics volume resource disk show -node node_1
Workload            ID   Disk Number of HDD Disks   Disk Number of SSD Disks
--------------- ------ ------ ------------------- ------ -------------------
-total-              -    32%                  26     0%                   0
vol4test-wid2..  23350    92%                   9     0%                   0
-total-              -    33%                  26     0%                   0
vol4test-wid2..  23350    96%                   9     0%                   0
-total-              -    33%                  26     0%                   0
vol4test-wid2..  23350    97%                   9     0%                   0
-total-              -    31%                  26     0%                   0
vol4test-wid2..  23350    91%                   9     0%                   0

Additional Information

Use of the node shell wafltop command can also be used to help identify which volumes/workload are the biggest consumers of various resources
What is Performance Capacity