Skip to main content
NetApp Knowledge Base

Is my controller overloaded?

Views:
211
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
perf
Last Updated:

Applies to

  • ONTAP 9

Answer

The answer to this question can be determined from Resource Headroom statistics.

  • Resource headroom statistics include details of utilization, operations and latency in the context of headroom guidance for a particular resource. Allowing:
    • Workload placement planning
    • Workload balancing
    • Visibility of resource performance capacity
    • Help identify workloads which are too high for a given node.
Resource Headroom

1. The ONTAP resource headroom object statistics facilitate understanding resource utilization and available headroom for CPU and aggregate resources.  

  • For CPU resources: resource_headroom_cpu.
  • For storage aggregate resources: resource_headroom_aggr.

 

2. The current_[ops|latency|utilization] and respective optimal_point_* counters provide point-in-time statistics of current utilization vs optimal points

  • The optimal_point is the point, where an increase in utilization or workload results in a disproportionately higher increase in latency.
  • From these counters, physical headroom or performance capacity can be calculated
    • Physical headroom is the difference between the current utilization and the optimal point 
    • If current utilization exceeds the optimal point then the resource is considered "overloaded."
  • The confidence factor is used to gage the accuracy of the optimal point for the given resource.
    • Denoted by the following values:
      • 1 - Low - Seed value is used for optimal point. There's not enough data to predict optimal point.
      • 2 - Medium - Some data to extrapolate optimal point.
      • 3 - High - Substantial data which reaches or exceeds optimal point, thereby the "optimal point" is known. 
      • 0 - Unknown - The resource is not available or is not in use, or there's an internal error such that the data cannot be retrieved. 

 

3. See following example for details on how to view resource_headroom statistics.

  • From the example below, we can see that both CPU and aggregate resources have been exceeded
  • Steps to reduce or balance workload should be taken, when utilization exceeds the optimal_point.
cluster::> set -privilege advanced
cluster::*> statistics start -object resource_headroom_cpu|resource_headroom_aggr
cluster::*> statistics show -object resource_headroom_cpu -counter current_ops|current_latency|current_utilization|optimal_point_latency|optimal_point_ops|optimal_point_utilization|optimal_point_confidence_factor

Object: resource_headroom_cpu
Instance: CPU_node_2
Start-time: 6/17/2020 12:31:57
End-time: 6/17/2020 13:31:57
Elapsed-time: 3600s
Scope: node_2

    Counter                                                     Value
    -------------------------------- --------------------------------
                         current_ops                             1506
                   optimal_point_ops                             1264
                     current_latency                             3761
               optimal_point_latency                             1446
                 current_utilization                               82
           optimal_point_utilization                               57
     optimal_point_confidence_factor                                3

        
cluster::*> statistics show -object resource_headroom_aggr -counter current_ops|current_latency|current_utilization|optimal_point_latency|optimal_point_ops|optimal_point_utilization|optimal_point_confidence_factor

Object: resource_headroom_aggr
Instance: DISK_HDD_node_2_aggr1_fb7a0d4f-9d65-4211-b651-b4cd422ee11d
Start-time: 6/17/2020 12:37:57
End-time: 6/17/2020 13:37:57
Elapsed-time: 3600s
Scope: node_2

    Counter                                                     Value
    -------------------------------- --------------------------------
                         current_ops                             1488
                   optimal_point_ops                             1156
                     current_latency                            38924
               optimal_point_latency                            28913
                 current_utilization                               67  
           optimal_point_utilization                               52  
     optimal_point_confidence_factor                                3

4. Higher time-frame resource statistics are available from Active-IQ performance dashboards, which are more useful for capacity planning.

clipboard_e7dffc72a721ca12fee1a99506202e982.png

Workload Utilization
  • Details of how much of a given resource can be determined with the use of the workload or qos statistics
    • Qos statistics can provide a point-in-time statistics of resource utilization of workloads, on a per-node basis 
    • From the example below, we can see that volume vol4test, is a heavy consumer of both CPU and aggregate hdd resources.
cluster::> qos statistics volume resource cpu show -node node_1
Workload           ID   CPU 
--------------- ----- ----- 
-total- (400%)      -   69% 
vol4test-wid2.. 23350   69% 
-total- (400%)      -   73% 
vol4test-wid2.. 23350   73% 
-total- (400%)      -   73% 
vol4test-wid2.. 23350   73% 
-total- (400%)      -   70% 
vol4test-wid2.. 23350   70% 

cluster::> qos statistics volume resource disk show -node node_1
Workload            ID   Disk Number of HDD Disks   Disk Number of SSD Disks
--------------- ------ ------ ------------------- ------ -------------------
-total-              -    32%                  26     0%                   0
vol4test-wid2..  23350    92%                   9     0%                   0
-total-              -    33%                  26     0%                   0
vol4test-wid2..  23350    96%                   9     0%                   0
-total-              -    33%                  26     0%                   0
vol4test-wid2..  23350    97%                   9     0%                   0
-total-              -    31%                  26     0%                   0
vol4test-wid2..  23350    91%                   9     0%                   0

 

Additional Information

  • Use of the node shell wafltop command can also be used to help identify which volumes/workload are the biggest consumers of various resources
  • What is Performance Capacity