Skip to main content
NetApp Knowledge Base

Is my controller overloaded?

Views:
3,591
Visibility:
Public
Votes:
3
Category:
ontap-9
Specialty:
perf
Last Updated:

Applies to

ONTAP 9

Answer

  • The answer to this question can be determined from Resource Headroom statistics for a quick measurement or Performance Capacity from Active IQ Unified Manager
  • Resource headroom statistics include details of utilization, operations and latency in the context of headroom guidance for a particular resource to give:
    • Workload placement planning
    • Workload balancing
    • Visibility of resource performance capacity
    • Help identify workloads which are too high for a given node.
Resource Headroom
  • The ONTAP resource headroom object statistics facilitate understanding resource utilization and available headroom for CPU and aggregate resources.  
    • For CPU resources: resource_headroom_cpu.
    • For storage aggregate resources: resource_headroom_aggr.

 

  • The current_[ops|latency|utilization] and respective optimal_point_* counters provide point-in-time statistics of current utilization vs optimal points
    • The optimal_point is the point, where an increase in utilization or workload results in a disproportionately higher increase in latency.
    • From these counters, physical headroom or performance capacity can be calculated
      • Physical headroom is the difference between the current utilization and the optimal point 
      • If current utilization exceeds the optimal point then the resource is considered "overloaded."
    • The confidence factor is used to gage the accuracy of the optimal point for the given resource.
      • Denoted by the following values:
        • 1 - Low - Seed value is used for optimal point. There's not enough data to predict optimal point.
        • 2 - Medium - Some data to extrapolate optimal point.
        • 3 - High - Substantial data which reaches or exceeds optimal point, thereby the "optimal point" is known. 
        • 0 - Unknown - The resource is not available or is not in use, or there's an internal error such that the data cannot be retrieved. 

 

Example: Viewing headroom statistics for a node where CPU and aggregate resources are exceeded

cluster::> set -privilege advanced
cluster::*> statistics start -object resource_headroom_cpu|resource_headroom_aggr
cluster::*> statistics show -object resource_headroom_cpu -counter current_ops|current_latency|current_utilization|optimal_point_latency|optimal_point_ops|optimal_point_utilization|optimal_point_confidence_factor

Object: resource_headroom_cpu
Instance: CPU_node_2
Start-time: 6/17/2020 12:31:57
End-time: 6/17/2020 13:31:57
Elapsed-time: 3600s
Scope: node_2

    Counter                                                     Value
    -------------------------------- --------------------------------
                         current_ops                             1506
                   optimal_point_ops                             1264
                     current_latency                             3761
               optimal_point_latency                             1446
                 current_utilization                               82
           optimal_point_utilization                               57
     optimal_point_confidence_factor                                3

        
cluster::*> statistics show -object resource_headroom_aggr -counter current_ops|current_latency|current_utilization|optimal_point_latency|optimal_point_ops|optimal_point_utilization|optimal_point_confidence_factor

Object: resource_headroom_aggr
Instance: DISK_HDD_node_2_aggr1_fb7a0d4f-9d65-4211-b651-b4cd422ee11d
Start-time: 6/17/2020 12:37:57
End-time: 6/17/2020 13:37:57
Elapsed-time: 3600s
Scope: node_2

    Counter                                                     Value
    -------------------------------- --------------------------------
                         current_ops                             1488
                   optimal_point_ops                             1156
                     current_latency                            38924
               optimal_point_latency                            28913
                 current_utilization                               67  
           optimal_point_utilization                               52  
     optimal_point_confidence_factor                                3
  • Higher time-frame resource statistics are available from Active-IQ performance dashboards, which are more useful for capacity planning.
  • The peak_performance metric  in the graphs represents the optimal_point_utilization counter from the resource_headroom statistics.

clipboard_e7dffc72a721ca12fee1a99506202e982.png

Workload Utilization
  • Details of how much of a given resource can be determined with the use of the workload or qos statistics
    • Qos statistics can provide a point-in-time statistics of resource utilization of workloads, on a per-node basis 

Example: volume vol4test is a heavy consumer of both CPU and aggregate resources.

cluster::> qos statistics volume resource cpu show -node node_1
Workload           ID   CPU 
--------------- ----- ----- 
-total- (400%)      -   69% 
vol4test-wid2.. 23350   69% 
-total- (400%)      -   73% 
vol4test-wid2.. 23350   73% 
-total- (400%)      -   73% 
vol4test-wid2.. 23350   73% 
-total- (400%)      -   70% 
vol4test-wid2.. 23350   70% 

cluster::> qos statistics volume resource disk show -node node_1
Workload            ID   Disk Number of HDD Disks   Disk Number of SSD Disks
--------------- ------ ------ ------------------- ------ -------------------
-total-              -    32%                  26     0%                   0
vol4test-wid2..  23350    92%                   9     0%                   0
-total-              -    33%                  26     0%                   0
vol4test-wid2..  23350    96%                   9     0%                   0
-total-              -    33%                  26     0%                   0
vol4test-wid2..  23350    97%                   9     0%                   0
-total-              -    31%                  26     0%                   0
vol4test-wid2..  23350    91%                   9     0%                   0

 

Additional Information

  • Use of the node shell wafltop command can also be used to help identify which volumes/workload are the biggest consumers of various resources
  • What is Performance Capacity

 

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.