Active IQ Unified Manager database corruption caused by disk running out of space
Applies to
- ActiveIQ Unified Manager (AIQUM) 9.6+
- RHEL or CentOS
Issue
- Unable to view any historical and current performance data for any cluster in AIQUM GUI
- MySQL error logs reports following messages.
[ERROR] [MY-000035] [Server] Disk is full writing './unified-manager.004948' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 secs.
2023-07-25T06:46:41.283486Z 1271
[ERROR] [MY-010907] [Server] Error writing file 'unified-manager' (errno: 28 - No space left on device)
2023-07-25T06:46:41.284609Z 1271
[ERROR] [MY-011072] [Server] Binary logging not possible. Message: An error occurred during flush stage of the commit. 'binlog_error_action' is set to 'ABORT_SERVER'. Server is being stopped..
2023-07-25T06:46:41Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
BuildID[sha1]=64a1d52e8c241c89abf59dc7d461f945ce41974c
Thread pointer: 0x7f66d61e2000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
[Warning] [MY-012351] [InnoDB] Tablespace 1097, name 'netapp_performance/sample_qos_volume_workload_null#p#p0', file './netapp_performance/sample_qos_volume_workload_null#p#p0.ibd' is missing!
[Warning] [MY-012351] [InnoDB] Tablespace 1098, name 'netapp_performance/summary_qos_volume_workload_null', file './netapp_performance/summary_qos_volume_workload_null.ibd' is missing!
[Warning] [MY-012351] [InnoDB] Tablespace 3032, name 'netapp_performance/sample_fcpport#p#p13', file './netapp_performance/sample_fcpport#p#p13.ibd' is missing!
[ERROR] [MY-012592] [InnoDB] Operating system error number 2 in a file operation.
[ERROR] [MY-012593] [InnoDB] The error means the system cannot find the path specified.
[ERROR] [MY-012216] [InnoDB] Cannot open datafile for read-only: './netapp_performance/sample_cluster#p#p14.ibd' OS error: 71
- Ocumserver.log & ServerMega.log reports a lot of performance tables are missing due to corruption in AIQUM that was done.
ERROR [oncommand] [opmTaskExecutor-1] [c.n.i.s.a.dao.NodeInventoryDao] fetchNodeInventoryStats for Nodes query: SELECT configSet.objid, configSet.elementName, configSet.elementResourceKey, configSet.numberOfHours, COALESCE(SUM(avgLatency*totalOps)/SUM(totalOps),SUM(avgLatency*totalOps)) latency, AVG(totalOps) ops, AVG(opmSysThroughput) throughput, AVG(opmSysReadThroughput) readThroughput, AVG(opmSysUtilization) AS nodeUtilization, configSet.freeCapacity, configSet.totalCapacity, configSet.clusterId, configSet.clusterName, configSet.clusterResourceKey, configSet.thresholdPolicyId, configSet.thresholdPolicyName, configSet.clusterFqdn, configSet.cacheReadThroughput, configSet.usedHeadroom, configSet.availableOps, 0 AS eventSeverity FROM ((SELECT n.objId AS objid, n.name elementName, n.resourceKey elementResourceKey, 72 numberOfHours, (n.aggregateBytesTotal-n.aggregateBytesUsed) freeCapacity, n.aggregateBytesTotal totalCapacity, c.objId clusterId, c.name clusteRName, c.resourceKey clusterResourceKey, GROUP_CONCAT(DISTINCT tp.id ORDER BY tp.id) thresholdPolicyId, GROUP_CONCAT(DISTINCT tp.name ORDER BY tpm.policyId) thresholdPolicyName, cluster.fqdn AS clusterFqdn , AVG(ext.cacheReadThroughput) AS cacheReadThroughput , AVG(cHRoomUsedPercent) usedHeadroom , AVG(availableOps) availableOps FROM netapp_model_view.node n JOIN netapp_model_view.cluster c on c.objId=n.clusterId LEFT JOIN ocum.cluster cluster ON (c.objid = cluster.id) LEFT JOIN opm.threshold_policy_mapping tpm ON (tpm.objectId=n.objId AND tpm.endTime is null) LEFT JOIN opm.threshold_policy tp ON (tp.id=tpm.policyId AND tp.elementType=9) LEFT JOIN netapp_performance.summary_extcacheobj ext ON (n.objId = ext.objId AND ext.fromtime >= 1690050600000 AND ext.fromtime < 1690309864758) LEFT JOIN netapp_performance.summary_opm_headroom_cpu sohc ON (n.objId = sohc.objId AND sohc.fromtime > 1690050600000 AND sohc.fromtime < 1690309864758) GROUP BY n.objId) configSet LEFT JOIN netapp_performance.summary_node sn ON (configSet.objId = sn.objId AND fromtime >= 1690050600000 AND fromtime < 1690309864758) ) GROUP BY configSet.objId failed.
org.springframework.dao.TransientDataAccessResourceException: StatementCallback; SQL [ SELECT configSet.objid, configSet.elementName, configSet.elementResourceKey, configSet.numberOfHours, COALESCE(SUM(avgLatency*totalOps)/SUM(totalOps),SUM(avgLatency*totalOps)) latency, AVG(totalOps) ops, AVG(opmSysThroughput) throughput, AVG(opmSysReadThroughput) readThroughput, AVG(opmSysUtilization) AS nodeUtilization, configSet.freeCapacity, configSet.totalCapacity, configSet.clusterId, configSet.clusterName, configSet.clusterResourceKey, configSet.thresholdPolicyId, configSet.thresholdPolicyName, configSet.clusterFqdn, configSet.cacheReadThroughput, configSet.usedHeadroom, configSet.availableOps, 0 AS eventSeverity FROM ((SELECT n.objId AS objid, n.name elementName, n.resourceKey elementResourceKey, 72 numberOfHours, (n.aggregateBytesTotal-n.aggregateBytesUsed) freeCapacity, n.aggregateBytesTotal totalCapacity, c.objId clusterId, c.name clusteRName, c.resourceKey clusterResourceKey, GROUP_CONCAT(DISTINCT tp.id ORDER BY tp.id) thresholdPolicyId, GROUP_CONCAT(DISTINCT tp.name ORDER BY tpm.policyId) thresholdPolicyName, cluster.fqdn AS clusterFqdn , AVG(ext.cacheReadThroughput) AS cacheReadThroughput , AVG(cHRoomUsedPercent) usedHeadroom , AVG(availableOps) availableOps FROM netapp_model_view.node n JOIN netapp_model_view.cluster c on c.objId=n.clusterId LEFT JOIN ocum.cluster cluster ON (c.objid = cluster.id) LEFT JOIN opm.threshold_policy_mapping tpm ON (tpm.objectId=n.objId AND tpm.endTime is null) LEFT JOIN opm.threshold_policy tp ON (tp.id=tpm.policyId AND tp.elementType=9) LEFT JOIN netapp_performance.summary_extcacheobj ext ON (n.objId = ext.objId AND ext.fromtime >= 1690050600000 AND ext.fromtime < 1690309864758) LEFT JOIN netapp_performance.summary_opm_headroom_cpu sohc ON (n.objId = sohc.objId AND sohc.fromtime > 1690050600000 AND sohc.fromtime < 1690309864758) GROUP BY n.objId) configSet LEFT JOIN netapp_performance.summary_node sn ON (configSet.objId = sn.objId AND fromtime >= 1690050600000 AND fromtime < 1690309864758) ) GROUP BY configSet.objId]; (conn=868) Tablespace is missing for table `netapp_performance`.`summary_extcacheobj`.; nested exception is java.sql.SQLTransientConnectionException: (conn=868) Tablespace is missing for table `netapp_performance`.`summary_extcacheobj`.
ERROR [default task-59] c.n.o.p.f.p.t.SampleTable (SampleTable.java:373) - Failed to execute prepared statement for com.netapp.oci.platform.framework.performance.tables.PartitionedSampleTable@6fcaf188: java.sql.SQLSyntaxErrorException: (conn=71) Table 'netapp_performance.sample_qos_service_center_27699' doesn't exist