Skip to main content
NetApp Knowledge Base

Cassandra repair progress slow alert and frequent cassandra-reaper service restarts on StorageGRID 11.4

Views:
2,049
Visibility:
Public
Votes:
1
Category:
storagegrid-webscale
Specialty:
sgrid
Last Updated:

Applies to

  • NetApp StorageGRID 11.4 (pre-11.4.0.3)
  • New StorageGRID deployment
  • NetApp StorageGRID environment upgraded from 11.3 (pre-11.3.0.11 hotfix)

Issue

  • After new deployment of StorageGRID 11.4 or upgrade from pre-11.3.0.11 release (example 11.3.0.10 or any other build of 11.3) to 11.4, users may receive the following alert in the StorageGRID GUI:
progress slow alert.PNG
 
  • The Cassandra repair progress slow maybe a result of many issues including service unavailability and communication issues.
  • In order to confirm the issue matches this article, there are few additional signatures that can be checked:
  1. The Cassandra repair progress slow  alert has persisted over 2 days with effective repair percentage at 0%.
  2. The cassandra-reaper service responsible for the Cassandra repair operations is restarting frequently on various storage nodes. 

This can be confirmed via the /var/local/log/servermanager.log file on the storage node(s):

| cassandra-reaper          | restart initiated
| cassandra-reaper          | cassandra-reaper ended
| reaper                    | starting reaper

  1. Cassandra reaper log under /var/local/log/cassandra-reaper.log or in lumberjack collection reaper.log contain an exception for failure to achieve consistency level QUORUM or EACH_QUORUM:

WARN [storagegrid:615635d0-342b-11eb-b6cc-4bacd6a2d5fe:615c9e91-342b-11eb-b6cc-4bacd6a2d5fe] 2020-12-08 18:57:38,140 i.c.s.SegmentRunner - Failed to connect to a coordinator node for segment 615c9e91-342b-11eb-b6cc-4bacd6a2d5fe 

com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency EACH_QUORUM (2 required but only 0 alive)

  1. Cassandra reaper repair list from reaper_commands.txt in lumberjack collection of storage node(s) or by running this command spreaper --reaper-host=localhost --reaper-port=9403 status-cluster storagegrid in SSH session to a storage node, indicates that some or all keyspaces' repairs contain the following message for the last event:

      "creation_time": "2020-11-24T23:05:08Z", 
      "current_time": "2020-12-08T18:59:39Z", 
      "datacenters": [], 
      "duration": "7 days 0 hours 2 minutes 13 seconds", 
      "end_time": "2020-12-01T23:07:22Z", 
      "estimated_time_of_arrival": null, 
      "id": "7f8d00b0-2ea9-11eb-b76b-d7a5b22a5393", 
      "incremental_repair": false, 
      "intensity": 1.000, 
      "keyspace_name": "storagegrid", 
      "last_event": "Postponed a segment because no coordinator was reachable"
      "nodes": [], 
      "owner": "auto-scheduling", 
      "pause_time": null, 
      "repair_parallelism": "PARALLEL", 
      "repair_thread_count": 4, 
      "repair_unit_id": "dc8dbfa0-17c7-11eb-b890-676ddd59fc8a", 
      "segments_repaired": 0, 
      "start_time": "2020-11-24T23:05:08Z", 
      "state": "ABORTED", 

      "creation_time": "2020-11-17T20:50:58Z", 
      "current_time": "2020-12-08T18:59:40Z", 
      "datacenters": [], 
      "duration": "7 days 0 hours 0 minutes 32 seconds", 
      "end_time": "2020-11-24T20:51:31Z", 
      "estimated_time_of_arrival": null, 
      "id": "9882a450-2916-11eb-8180-07cae1e33f50", 
      "incremental_repair": false, 
      "intensity": 1.000, 
      "keyspace_name": "reaper_db", 
      "last_event": "Postponed a segment because no coordinator was reachable"
      "nodes": [], 
      "owner": "auto-scheduling", 
      "pause_time": null, 
      "repair_parallelism": "PARALLEL", 
      "repair_thread_count": 4, 
      "repair_unit_id": "dc818aa0-17c7-11eb-b890-676ddd59fc8a", 
      "segments_repaired": 0, 
      "start_time": "2020-11-17T20:50:59Z", 
      "state": "ABORTED", 

 

 

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.