Skip to main content
NetApp Response to Russia-Ukraine Cyber Threat
In response to the recent rise in cyber threat due to the Russian-Ukraine crisis, NetApp is actively monitoring the global security intelligence and updating our cybersecurity measures. We follow U.S. Federal Government guidance and remain on high alert. Customers are encouraged to monitor the Cybersecurity and Infrastructure Security (CISA) website for new information as it develops and remain on high alert.
NetApp Knowledge Base

Active IQ Wellness: Up to High Impact - This system is nearing the limits of its performance capacity

Views:
1,648
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
core
Last Updated:

 

Applies to

ONTAP 9 

Answer

Value of reviewing this information:
  • Performance capacity or headroom measures how much work you can place on a node or aggregate before the performance of workloads begins to be affected by latency.
  • Being aware of and managing available performance capacity helps ensure you provision and balance workloads to get expected response times. 
How this wellness check is validated?

Current performance capacity can be accessed and viewed by 3 different methods:

  • Follow the steps outlined in ONTAP 9 document: Identifying Remaining Performance Capacity 
    • This method leverages 1 month of heuristic data maintained on the ONTAP system.
  • Active IQ Unified Manager:
    • This method leverages 3 months of data collected by Active IQ Unified Manager in CM_Archive format. 
      1103633-1.png
  • Active IQ
    • (Node - CPU) CPU Performance capacity is the difference between peak_performance and current_utilization counters:
    • (Local Tier - Aggr Util%)  Please note Active IQ does not provide a Peak value so cannot be used to spotlight available performance capacity, however current utilization spikes can be viewed:clipboard_e084420359b07c557e397795620f403f3.png
      • Be aware that systems with no data aggregates or with backup/disaster recovery roles may exhibit low-performance headroom for aggr utilization due to low drive count or periodic high volume sequential IO.   
      • If increased per IO latency is not a concern for the system in question then instances of this risk can be ignored.
      • The risk is validated via AutoSupport Counter Manager data sent to NetApp in Daily Performance Data Notice AutoSupport messages. 
        • The data assessed aligns with the ONTAP 9 CLI 1-month calculations.
      • Available performance capacity is reviewed across all of existing NetApp systems to determine the level of impact for this alert: 
      • Values greater than the 99.5th percentile or top 0.5% will result in a High Risk
      • Values from the 99th to 99.5th percentile will result in a Medium Risk 
What should I do about the information provided by this Active IQ Wellness rule?  
  • If you already have a plan for this proactive Active IQ warning, acknowledge it within your Active IQ dashboard. 
  • This will ensure that the Wellness warnings you see are issues you do not have a plan in place to address. 
  •  To address this type of scenario:
  1. Do not attempt to increase workload if available performance capacity is insufficient to handle it and the current workloads cannot tolerate increased latency. 
  2. Ensure that you are monitoring workload indicators such as your throughput in xbps/IOPS/and utilization, so you can respond and plan before getting to the point of experiencing performance impact. 

A good start is the Performance Management guidance which includes using Active IQ Unified Manager, setting thresholds, and alerts. 
1103633-3.png

The following counters can be monitored: 
1103633-4.png

1103633-5.png
 

  1. If while monitoring selected thresholds, you detect warning about capacity threshold exceeded, reduce or relocate workload to less busy nodes as necessary to ensure continued expected performance. 
  2. Use Unified Manager’s Usage Overview Panel to identify top consuming workloads and try to ensure they don’t share the same controller. 
  3. Use Active IQ to review the difference between current and peak performance (CPU) or spikes in avg utilization (AGGR) which would be associated with performance capacity information provided by AutoSupport.
    If current utilization approaches peak performance or spikes are seen, the recommendation would be to review workloads and if there is an issue relocate workloads to less busy nodes. 
  4. Review KB: How to rectify performance issues using monitoring tools

Additional Information

Where can I find more information on this topic?

ONTAP 9 document: Identifying Remaining Performance Capacity 

 

Scan to view the article on your device