AIQUM detects Cluster Monitoring is stuck and fails to gather all node's information
Applies to
- Active IQ Unified Manager (AIQUM) 9.13P1
- ONTAP 9
Issue
- AIQUM detects
Cluster Monitoring is stuck
event:
Cluster Monitoring is stuck
Monitoring is stuck for the cluster <CLUSTER>. Reason: Monitoring completion time exceeded. Monitoring StartTime: <TIMESTAMP>, Monitoring EndTime: <TIMESTAMP>. Contact AIQUM technical support.
- AIQUM fails to gather information of all ONTAP clusters
ocumserver.log
indicates timeout during data acquisition:
ERROR [oncommand] [reconciliation-1] [<CLUSTER>(incremental@13:50:42.775)|complete] [c.n.d.c.CollectionCompletionNotifier] Timeout occurred while waiting on collection completion listener ClusterSparesEventDetector..EnhancerBySpringCGLIB..a7e633af. Cancelling it so that others can continue.
INFO [oncommand] [collection-completion-sync-8] [c.n.d.c.r.SnapmirrorTableCompletionListener] SnapmirrorRelationship Table is updated after collection completion for cluster <CLUSTER>
INFO [oncommand] [reconciliation-1] [<CLUSTER>(incremental@13:50:42.775)|complete] [c.n.d.c.CollectionCompletionNotifier] Collection completion notification for cluster <CLUSTER>(inventory.ontap.fas.Cluster:1910729) finished in 0:15:00.625.
INFO [oncommand] [reconciliation-1] [<CLUSTER>(incremental@13:50:42.775)] [com.netapp.dfm.collector.LockUtils] Releasing reconciliation-processing lock(java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@32cddd9e[Locked by thread reconciliation-1]) for 51
ERROR [oncommand] [reconciliation-1] [<CLUSTER>(incremental@13:50:42.775)] [c.n.dfm.collector.OcieJmsListener] Inventory change listener error
java.util.concurrent.TimeoutException: null
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204)
at deployment.dfm-app.war//com.netapp.dfm.collector.CollectionCompletionNotifier.notifyListeners(CollectionCompletionNotifier.java:320)
at deployment.dfm-app.war//com.netapp.dfm.collector.OcieJmsListener.reconcileAndNotify(OcieJmsListener.java:1322)
at deployment.dfm-app.war//com.netapp.dfm.collector.OcieJmsListener.reconcileDataSourceChanges(OcieJmsListener.java:1009)
at deployment.dfm-app.war//com.netapp.dfm.collector.OcieJmsListener.handleChangeMessage(OcieJmsListener.java:976)
at deployment.dfm-app.war//com.netapp.dfm.collector.OcieJmsListener$2.run(OcieJmsListener.java:483)
at deployment.dfm-app.war//com.netapp.dfm.common.metrics.executor.ThreadPoolMonitorExecutor.lambda$submit$1(ThreadPoolMonitorExecutor.java:179)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at deployment.dfm-app.war//com.netapp.dfm.common.metrics.executor.ThreadPoolMonitorExecutor.lambda$execute$0(ThreadPoolMonitorExecutor.java:165)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
journalctl
reports many Rate limit exceeded messages:
<AIQUM_HOST> kernel: Rate limit exceeded: IN=eth0 OUT= MAC=<MAC_ADDRESS> SRC=<IP> DST=<AIQUM_IP> LEN=52 TOS=0x02 PREC=0x00 TTL=124 ID=32617 DF PROTO=TCP SPT=64078 DPT=443 WINDOW=8192 RES=0x00 CWR ECE SYN URGP=0