Skip to main content
NetApp Knowledge Base

StorageGRID reports CassandraRepairProgressSlow due to intermittent network issue

Views:
40
Visibility:
Public
Votes:
0
Category:
storagegrid-webscale
Specialty:
sgrid
Last Updated:

Applies to

  • NetApp StorageGRID 11.7
  • StorageGRID deployments with imbalance of nodes across sites.

Issue

  • StorageGRID UI reports CassandraRepairProgressSlow alert. 
  • Support > Metrics> Cassandra Network Overview > Reaper Repair Percentage shows long periods of repair not running and starting again.
  • spreaperlist-runs output lists: 

"creation_time": "2024-12-30T21:13:21Z",
   "current_time": "2025-01-16T17:10:12Z",
    "datacenters": [],
    "duration": "16 days 19hours 56 minutes 50 seconds",
    "end_time": null,
    "estimated_time_of_arrival":"2025-03-15T07:52:03Z",
    "id": "xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2",
    "incremental_repair": false,
    "intensity": 1.0,
   "keyspace_name": "reaper_db",
   "last_event": "Postponed a segment because no coordinator was reachable",
    "nodes": [],
    "owner": "storagegrid",
    "pause_time": null,
    "repair_parallelism":"PARALLEL",
    "repair_thread_count": 4,
    "repair_unit_id":"XXXXX-ba4f-11eb-ba15-f975a0a43552",
    "segments_repaired": 3187,
   "start_time": "2024-12-30T21:13:22Z",
   "state": "RUNNING",
   "total_segments": 14096

  •  reaper.log (cassandra-x.x.x.x/reaper/reaper.log) from StorageGRID logs from a node at each site report: 
    • Node at Site 1:
      • INFO[StorageGRID-name1] 2025-01-16 17:21:08,651 i.c.s.CassandraStorage -Retry for 100th time, rethrowing
        WARN[storagegrid:]2025-01-16 17:21:08,652 i.c.s.SegmentRunner - Failed to connect to a coordinator node for segment xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2
        -com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeoutduring CAS write query at consistency SERIAL (4 replica were required but only3 acknowledged the write)
        INFO[storagegrid:]2025-01-16 17:21:08,850 i.c.s.RepairRunner - Postponed a segment because no coordinator was reachable
        INFO[storagegrid:]2025-01-16 17:21:08,851 i.c.s.SegmentRunner - Postponing segment xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2
    • Node at Site 2:
      • WARN[storagegrid:]2025-01-16 17:22:28,901 i.c.s.SegmentRunner - Failed to connect to a coordinator node for segment xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2
        com.datastax.driver.core.exceptions.WriteTimeoutException:Cassandra timeout during CAS write query at consistency SERIAL (4 replica were required but only 0 acknowledged the write)
        INFO[storagegrid:]2025-01-16 17:22:39,269 i.c.s.CassandraStorage - Trying to release lead on segment xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2 for instance xxxxxxxx-xxxx-xxxx-xxxx-97819177c25f
        ERROR[storagegrid:]2025-01-16 17:22:39,269 i.c.s.CassandraStorage - Could not release lead on segment xxxxxxx-xxxxx-xxxxx-xxxx-13c2ca0716a2

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.