ONTAP System Manager web interface is unresponsive while CLI remains available due to degraded RAID
Applies to
- ONTAP 9 (all currently supported releases)
- ONTAP System Manager (OSM) — browser-based cluster management UI
- FAS / AFF / ASA platforms with degraded aggregates or RAID groups
Issue
ONTAP System Manager (OSM) web interface is intermittently or persistently unresponsive on the cluster-management LIF, while CLI access via SSH and Service Processor Infrastructure (SPI) continues to work normally.
Observed symptoms:
- The browser stays on the loading spinner for a long time and never renders the dashboard, OR the dashboard appears briefly and then the session is logged out and returns to the login page.
- Navigating between System Manager menus (Storage, Network, Events, etc.) returns to the login screen or hangs.
- Browser developer tools show pending requests against the cluster-management LIF that are never answered (no HTTP response, eventually time out).
- Accessing
systemshelllogs from System Manager or CLI is unusually slow and frequently times out. - The same behavior is reproduced from multiple browsers, including a private/InPrivate window, on multiple clients.
Steps that do not resolve the issue (because they do not address the underlying load on the web tier):
- Moving the cluster-management LIF to a different node or port.
- Connecting to a node-management LIF instead of the cluster-management LIF.
- Restarting the web services with
system services web modify -external truefollowed by re-enabling, or restarting the related daemons. - Regenerating or replacing the cluster, node, or web SSL certificates.
- Performing a controlled takeover/giveback (TO/GB) or rolling reboot of the nodes.
Cause
The unresponsive System Manager UI is the visible symptom of two conditions running simultaneously on the cluster:
- An aggregate is in a long-standing degraded RAID state (DISK REDUNDANCY FAILED). One or more failed disks are not yet replaced, or replacement disks are present but the RAID group reconstruction has not completed and the spare partitions are not fully owned by the affected RAID group. The aggregate stays at risk and the management I/O path that backs the web UI competes for the same overloaded subsystem.
- External clients are polling the cluster-management LIF at a high frequency. Sources such as monitoring scripts, decommissioned automation, BlueXP / Active IQ collectors, or unmanaged third-party tools issue HTTPS calls every 10 to 15 seconds. With the backend already slowed down by the degraded RAID, the
httpd/php-fpmworkers that serve System Manager run out of free workers and queue up, so genuine browser sessions either never complete or are dropped.
SSH and SPI are unaffected because they do not traverse the same web-services stack and do not depend on the management I/O path that the degraded aggregate is throttling.
Solution
Confirm both conditions and address them in this order. Restoring web responsiveness without first fixing the RAID typically does not hold.
1. Confirm the RAID / aggregate state from CLI:
- Run
storage aggregate show -fields state,raid-status,is-inconsistentand look for any aggregate that is notonline,normal,raid_dp(or the equivalent RAID type). Anything reportingdegraded,reconstructing,partial, orraid_dp,degradedneeds attention. - Run
storage aggregate status -r(advanced privilege) and confirm there are nofailedormissingentries and that no plex is degraded. - Run
event log show -severity ALERT -message-name raid.rg.recons.* -message-name disk.failmsg -message-name callhome.disk.redun.failand confirm no active alerts. A standingcallhome.disk.redun.failindicates the RAID group is still without enough redundancy.
2. Restore RAID redundancy:
- If failed disks are present and not yet replaced, open a hardware-replacement case so the disks are dispatched and installed. If a previous hardware case exists, reference it so the prior actions are preserved on the new dispatch.
- Once new disks are inserted, verify ownership with
storage disk show -container-type spareand assign the spares to the correct node withstorage disk assign -disk <disk_name> -owner <node_name>if needed. - Confirm the RAID group is reconstructing with
storage aggregate show -instance <aggr_name>(look forReconstruction Status) andstorage aggregate plex show -aggregate <aggr_name>. - Wait for reconstruction to finish and the aggregate to return to
normal. Do not proceed with the web-tier checks until this completes.
3. Identify and stop the high-frequency external polling:
- Open a recent AutoSupport (ASUP) bundle and examine
mlog/httpd.logandmlog/php_error.log, or runset diag; systemshell -node <node> -command "tail -1000 /mroot/etc/log/mlog/httpd.log". Group the access lines by source IP and count requests per minute. - Any source IP issuing a request every 10 to 15 seconds is a candidate. Cross-check against expected collectors (BlueXP Connector, Active IQ Unified Manager, NetApp Harvest/Prometheus, in-house monitoring).
- Stop the unauthorized or decommissioned clients. If the client cannot be stopped immediately, restrict access at the network layer or with a firewall rule on the cluster-management subnet, or move the cluster-management LIF to a subnet the unwanted client cannot reach.
4. Re-test System Manager:
- Open System Manager from a private/InPrivate browser window using the cluster-management LIF address.
- If the dashboard still does not load:
- confirm the active certificate on the web role with
security ssl show -vserver <cluster> - verify it is the certificate currently presented by the LIF (use
openssl s_client -connect <cluster-mgmt-IP>:443). - Restart the web services
Note: If the cluster has a previous closed case for the same disk-replacement work and the RAID never returned to normal, treat the new request as a continuation and reference the prior case number so the dispatch team has full history.
Partner Notes
Internal Notes
- Source case: 2010610598 (Admentis, FAS2720, ONTAP 9.14.1P4, cluster CLUSTER-PRA). Issue ran for over two months while only certificate / web-service / LIF remediations were attempted.
- Diagnostic flow that found root cause: open recent ASUP, scan
mlog/httpd.logfor high-frequency repeating source IPs, then verify aggregate / RAID state from CLI; the combination of the two pointed at the right fix. - Related prior case where disks were dispatched but RAID never returned to
normal: 2010602866. When this pattern is suspected, always confirm RAID returned to normal in the previous case before treating the case as just a System Manager / web issue.
