Skip to main content

Exciting new changes are coming to the Knowledge Base site soon!
Starting April 4, 2023, you will notice Support-Specific categorization and improvements to the search filters on the site. In May, we will be launching a new and enhanced Site UI and Navigation. To know more, read our Knowledge Article.

NetApp Knowledge Base

Is my controller overloaded?

Views:
1,301
Visibility:
Public
Votes:
1
Category:
ontap-9
Specialty:
perf
Last Updated:

Applies to

ONTAP 9

Answer

  • The answer to this question can be determined from Resource Headroom statistics for a quick measurement or Performance Capacity from Active IQ Unified Manager
  • Resource headroom statistics include details of utilization, operations and latency in the context of headroom guidance for a particular resource to give:
    • Workload placement planning
    • Workload balancing
    • Visibility of resource performance capacity
    • Help identify workloads which are too high for a given node.
Resource Headroom
  • The ONTAP resource headroom object statistics facilitate understanding resource utilization and available headroom for CPU and aggregate resources.  
    • For CPU resources: resource_headroom_cpu.
    • For storage aggregate resources: resource_headroom_aggr.

 

  • The current_[ops|latency|utilization] and respective optimal_point_* counters provide point-in-time statistics of current utilization vs optimal points
    • The optimal_point is the point, where an increase in utilization or workload results in a disproportionately higher increase in latency.
    • From these counters, physical headroom or performance capacity can be calculated
      • Physical headroom is the difference between the current utilization and the optimal point 
      • If current utilization exceeds the optimal point then the resource is considered "overloaded."
    • The confidence factor is used to gage the accuracy of the optimal point for the given resource.
      • Denoted by the following values:
        • 1 - Low - Seed value is used for optimal point. There's not enough data to predict optimal point.
        • 2 - Medium - Some data to extrapolate optimal point.
        • 3 - High - Substantial data which reaches or exceeds optimal point, thereby the "optimal point" is known. 
        • 0 - Unknown - The resource is not available or is not in use, or there's an internal error such that the data cannot be retrieved. 

 

Example: Viewing headroom statistics for a node where CPU and aggregate resources are exceeded

cluster::> set -privilege advanced
cluster::*> statistics start -object resource_headroom_cpu|resource_headroom_aggr
cluster::*> statistics show -object resource_headroom_cpu -counter current_ops|current_latency|current_utilization|optimal_point_latency|optimal_point_ops|optimal_point_utilization|optimal_point_confidence_factor

Object: resource_headroom_cpu
Instance: CPU_node_2
Start-time: 6/17/2020 12:31:57
End-time: 6/17/2020 13:31:57
Elapsed-time: 3600s
Scope: node_2

    Counter                                                     Value
    -------------------------------- --------------------------------
                         current_ops                             1506
                   optimal_point_ops                             1264
                     current_latency                             3761
               optimal_point_latency                             1446
                 current_utilization                               82
           optimal_point_utilization                               57
     optimal_point_confidence_factor                                3

        
cluster::*> statistics show -object resource_headroom_aggr -counter current_ops|current_latency|current_utilization|optimal_point_latency|optimal_point_ops|optimal_point_utilization|optimal_point_confidence_factor

Object: resource_headroom_aggr
Instance: DISK_HDD_node_2_aggr1_fb7a0d4f-9d65-4211-b651-b4cd422ee11d
Start-time: 6/17/2020 12:37:57
End-time: 6/17/2020 13:37:57
Elapsed-time: 3600s
Scope: node_2

    Counter                                                     Value
    -------------------------------- --------------------------------
                         current_ops                             1488
                   optimal_point_ops                             1156
                     current_latency                            38924
               optimal_point_latency                            28913
                 current_utilization                               67  
           optimal_point_utilization                               52  
     optimal_point_confidence_factor                                3
  • Higher time-frame resource statistics are available from Active-IQ performance dashboards, which are more useful for capacity planning.
  • The peak_performance metric  in the graphs represents the optimal_point_utilization counter from the resource_headroom statistics.

clipboard_e7dffc72a721ca12fee1a99506202e982.png

Workload Utilization
  • Details of how much of a given resource can be determined with the use of the workload or qos statistics
    • Qos statistics can provide a point-in-time statistics of resource utilization of workloads, on a per-node basis 

Example: volume vol4test is a heavy consumer of both CPU and aggregate resources.

cluster::> qos statistics volume resource cpu show -node node_1
Workload           ID   CPU 
--------------- ----- ----- 
-total- (400%)      -   69% 
vol4test-wid2.. 23350   69% 
-total- (400%)      -   73% 
vol4test-wid2.. 23350   73% 
-total- (400%)      -   73% 
vol4test-wid2.. 23350   73% 
-total- (400%)      -   70% 
vol4test-wid2.. 23350   70% 

cluster::> qos statistics volume resource disk show -node node_1
Workload            ID   Disk Number of HDD Disks   Disk Number of SSD Disks
--------------- ------ ------ ------------------- ------ -------------------
-total-              -    32%                  26     0%                   0
vol4test-wid2..  23350    92%                   9     0%                   0
-total-              -    33%                  26     0%                   0
vol4test-wid2..  23350    96%                   9     0%                   0
-total-              -    33%                  26     0%                   0
vol4test-wid2..  23350    97%                   9     0%                   0
-total-              -    31%                  26     0%                   0
vol4test-wid2..  23350    91%                   9     0%                   0

 

Additional Information

  • Use of the node shell wafltop command can also be used to help identify which volumes/workload are the biggest consumers of various resources
  • What is Performance Capacity

 

Scan to view the article on your device