StorageGRID reports CassandraRepairProgressSlow due to intermittent network issue
Applies to
- NetApp StorageGRID 11.7
- StorageGRID deployments with imbalance of nodes across sites.
Issue
- StorageGRID UI reports
CassandraRepairProgressSlow
alert. - Support > Metrics> Cassandra Network Overview > Reaper Repair Percentage shows long periods of repair not running and starting again.
spreaperlist-runs
output lists:
"creation_time": "2024-12-30T21:13:21Z",
"current_time": "2025-01-16T17:10:12Z",
"datacenters": [],
"duration": "16 days 19hours 56 minutes 50 seconds",
"end_time": null,
"estimated_time_of_arrival":"2025-03-15T07:52:03Z",
"id": "xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2",
"incremental_repair": false,
"intensity": 1.0,
"keyspace_name": "reaper_db",
"last_event": "Postponed a segment because no coordinator was reachable",
"nodes": [],
"owner": "storagegrid",
"pause_time": null,
"repair_parallelism":"PARALLEL",
"repair_thread_count": 4,
"repair_unit_id":"XXXXX-ba4f-11eb-ba15-f975a0a43552",
"segments_repaired": 3187,
"start_time": "2024-12-30T21:13:22Z",
"state": "RUNNING",
"total_segments": 14096
-
reaper.log
(cassandra-x.x.x.x/reaper/reaper.log
) from StorageGRID logs from a node at each site report:- Node at Site 1:
INFO[StorageGRID-name1] 2025-01-16 17:21:08,651 i.c.s.CassandraStorage -Retry for 100th time, rethrowing
WARN[storagegrid:]2025-01-16 17:21:08,652 i.c.s.SegmentRunner - Failed to connect to a coordinator node for segment xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2
-com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeoutduring CAS write query at consistency SERIAL (4 replica were required but only3 acknowledged the write)
INFO[storagegrid:]2025-01-16 17:21:08,850 i.c.s.RepairRunner - Postponed a segment because no coordinator was reachable
INFO[storagegrid:]2025-01-16 17:21:08,851 i.c.s.SegmentRunner - Postponing segment xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2
- Node at Site 2:
WARN[storagegrid:]2025-01-16 17:22:28,901 i.c.s.SegmentRunner - Failed to connect to a coordinator node for segment xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2
com.datastax.driver.core.exceptions.WriteTimeoutException:Cassandra timeout during CAS write query at consistency SERIAL (4 replica were required but only 0 acknowledged the write)
INFO[storagegrid:]2025-01-16 17:22:39,269 i.c.s.CassandraStorage - Trying to release lead on segment xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2 for instance xxxxxxxx-xxxx-xxxx-xxxx-97819177c25f
ERROR[storagegrid:]2025-01-16 17:22:39,269 i.c.s.CassandraStorage - Could not release lead on segment xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2
- Node at Site 1: