Skip to main content
NetApp Knowledge Base

ONTAP System Manager web interface is unresponsive while CLI remains available due to degraded RAID

Views:
23
Visibility:
Public
Votes:
0
Category:
not set
Specialty:
not set
Last Updated:

Applies to

  • ONTAP 9 (all currently supported releases)
  • ONTAP System Manager (OSM) — browser-based cluster management UI
  • FAS / AFF / ASA platforms with degraded aggregates or RAID groups

Issue

ONTAP System Manager (OSM) web interface is intermittently or persistently unresponsive on the cluster-management LIF, while CLI access via SSH and Service Processor Infrastructure (SPI) continues to work normally.

Observed symptoms:

  • The browser stays on the loading spinner for a long time and never renders the dashboard, OR the dashboard appears briefly and then the session is logged out and returns to the login page.
  • Navigating between System Manager menus (Storage, Network, Events, etc.) returns to the login screen or hangs.
  • Browser developer tools show pending requests against the cluster-management LIF that are never answered (no HTTP response, eventually time out).
  • Accessing systemshell logs from System Manager or CLI is unusually slow and frequently times out.
  • The same behavior is reproduced from multiple browsers, including a private/InPrivate window, on multiple clients.

Steps that do not resolve the issue (because they do not address the underlying load on the web tier):

  • Moving the cluster-management LIF to a different node or port.
  • Connecting to a node-management LIF instead of the cluster-management LIF.
  • Restarting the web services with system services web modify -external true followed by re-enabling, or restarting the related daemons.
  • Regenerating or replacing the cluster, node, or web SSL certificates.
  • Performing a controlled takeover/giveback (TO/GB) or rolling reboot of the nodes.

Cause

The unresponsive System Manager UI is the visible symptom of two conditions running simultaneously on the cluster:

  1. An aggregate is in a long-standing degraded RAID state (DISK REDUNDANCY FAILED). One or more failed disks are not yet replaced, or replacement disks are present but the RAID group reconstruction has not completed and the spare partitions are not fully owned by the affected RAID group. The aggregate stays at risk and the management I/O path that backs the web UI competes for the same overloaded subsystem.
  2. External clients are polling the cluster-management LIF at a high frequency. Sources such as monitoring scripts, decommissioned automation, BlueXP / Active IQ collectors, or unmanaged third-party tools issue HTTPS calls every 10 to 15 seconds. With the backend already slowed down by the degraded RAID, the httpd / php-fpm workers that serve System Manager run out of free workers and queue up, so genuine browser sessions either never complete or are dropped.

SSH and SPI are unaffected because they do not traverse the same web-services stack and do not depend on the management I/O path that the degraded aggregate is throttling.

Solution

Confirm both conditions and address them in this order. Restoring web responsiveness without first fixing the RAID typically does not hold.

1. Confirm the RAID / aggregate state from CLI:

  • Run storage aggregate show -fields state,raid-status,is-inconsistent and look for any aggregate that is not online,normal,raid_dp (or the equivalent RAID type). Anything reporting degraded, reconstructing, partial, or raid_dp,degraded needs attention.
  • Run storage aggregate status -r (advanced privilege) and confirm there are no failed or missing entries and that no plex is degraded.
  • Run event log show -severity ALERT -message-name raid.rg.recons.* -message-name disk.failmsg -message-name callhome.disk.redun.fail and confirm no active alerts. A standing callhome.disk.redun.fail indicates the RAID group is still without enough redundancy.

2. Restore RAID redundancy:

  • If failed disks are present and not yet replaced, open a hardware-replacement case so the disks are dispatched and installed. If a previous hardware case exists, reference it so the prior actions are preserved on the new dispatch.
  • Once new disks are inserted, verify ownership with storage disk show -container-type spare and assign the spares to the correct node with storage disk assign -disk <disk_name> -owner <node_name> if needed.
  • Confirm the RAID group is reconstructing with storage aggregate show -instance <aggr_name> (look for Reconstruction Status) and storage aggregate plex show -aggregate <aggr_name>.
  • Wait for reconstruction to finish and the aggregate to return to normal. Do not proceed with the web-tier checks until this completes.

3. Identify and stop the high-frequency external polling:

  • Open a recent AutoSupport (ASUP) bundle and examine mlog/httpd.log and mlog/php_error.log, or run set diag; systemshell -node <node> -command "tail -1000 /mroot/etc/log/mlog/httpd.log". Group the access lines by source IP and count requests per minute.
  • Any source IP issuing a request every 10 to 15 seconds is a candidate. Cross-check against expected collectors (BlueXP Connector, Active IQ Unified Manager, NetApp Harvest/Prometheus, in-house monitoring).
  • Stop the unauthorized or decommissioned clients. If the client cannot be stopped immediately, restrict access at the network layer or with a firewall rule on the cluster-management subnet, or move the cluster-management LIF to a subnet the unwanted client cannot reach.

4. Re-test System Manager:

  • Open System Manager from a private/InPrivate browser window using the cluster-management LIF address.
  • If the dashboard still does not load:
  • confirm the active certificate on the web role with security ssl show -vserver <cluster>
  • verify it is the certificate currently presented by the LIF (use openssl s_client -connect <cluster-mgmt-IP>:443).
  • Restart the web services

Note: If the cluster has a previous closed case for the same disk-replacement work and the RAID never returned to normal, treat the new request as a continuation and reference the prior case number so the dispatch team has full history.

Partner Notes

 

Internal Notes

  • Source case: 2010610598 (Admentis, FAS2720, ONTAP 9.14.1P4, cluster CLUSTER-PRA). Issue ran for over two months while only certificate / web-service / LIF remediations were attempted.
  • Diagnostic flow that found root cause: open recent ASUP, scan mlog/httpd.log for high-frequency repeating source IPs, then verify aggregate / RAID state from CLI; the combination of the two pointed at the right fix.
  • Related prior case where disks were dispatched but RAID never returned to normal: 2010602866. When this pattern is suspected, always confirm RAID returned to normal in the previous case before treating the case as just a System Manager / web issue.

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.