ActiveIQ Unified Manager 9.13+ becomes unresponsive due to high CPU utilization
Applies to
- ActiveIQ Unified Manager (AIQUM) 9.13 and later
- RHEL/OVA
Issue
- AIQUM 9.13+ gets unresponsive intermittently
- Rebooting the server solves the issue for few days before it goes into hung state again
- Sudden high CPU utilization is observed when the issue occurs
ps -eo pcpu,pid,user,args | sort -k 1 -r | head -10
shows jboss user and Java process is consuming more than 100% CPU:
- All kinds of jobs associated with AIQUM like acquisition/Commvault aux copy jobs fails.
ocumserver.log
has an NullPointerException(NPE) recorded every hour for scale db pool:
ERROR [oncommand] [task-scheduler-10] [c.n.s.s.a.DbPoolScaleMonitor] Exception occurred while detecting scale db pool issue:
java.lang.NullPointerException: null
at deployment.dfm-app.war//com.netapp.dfm.common.metrics.MetricsRegistryProvider.getGauge(MetricsRegistryProvider.java:168)
at deployment.dfm-app.war//com.netapp.scalemonitor.service.automation.ScaleMonitorUtils.getDbConnectionData(ScaleMonitorUtils.java:71)
at deployment.dfm-app.war//com.netapp.scalemonitor.service.automation.DbPoolScaleMonitor.detectProblemsAndProvideRecommendation(DbPoolScaleMonitor.java:90)
....
....
journalctl.log
has multiple entries of Rate limit exceeded from one or more sources:
ocum kernel: Rate limit exceeded: IN=eth0 OUT= MAC=<YY:YY:YY:YY:YY:YY:YY:YY:YY:YY:YY:YY:YY> SRC=XX.XX.XX.XX DST=<AIQUM_IP> LEN=52 TOS=0x02 PREC=0x00 TTL=126 ID=9520 DF PROTO=TCP SPT=52199 DPT=443 WINDOW=8192 RES=0x00 CWR ECE SYN URGP=0
ocum kernel: Rate limit exceeded: IN=eth0 OUT= MAC=<YY:YY:YY:YY:YY:YY:YY:YY:YY:YY:YY:YY:YY> SRC=XX.XX.XX.XX DST=<AIQUM_IP> LEN=52 TOS=0x02 PREC=0x00 TTL=126 ID=9591 DF PROTO=TCP SPT=52206 DPT=443 WINDOW=8192 RES=0x00 CWR ECE SYN URGP=0