Cassandra repair progress slow alert and frequent cassandra-reaper service restarts on StorageGRID 11.4
Applies to
- NetApp StorageGRID 11.4 (pre-11.4.0.3)
- New StorageGRID deployment
- NetApp StorageGRID environment upgraded from 11.3 (pre-11.3.0.11 hotfix)
Issue
- After new deployment of StorageGRID 11.4 or upgrade from pre-11.3.0.11 release (example 11.3.0.10 or any other build of 11.3) to 11.4, users may receive the following alert in the StorageGRID GUI:
- The
Cassandra repair progress slow
maybe a result of many issues including service unavailability and communication issues. - In order to confirm the issue matches this article, there are few additional signatures that can be checked:
- The
Cassandra repair progress slow
alert has persisted over 2 days with effective repair percentage at 0%. - The cassandra-reaper service responsible for the Cassandra repair operations is restarting frequently on various storage nodes.
This can be confirmed via the /var/local/log/servermanager.log
file on the storage node(s):
| cassandra-reaper | restart initiated
| cassandra-reaper | cassandra-reaper ended
| reaper | starting reaper
- Cassandra reaper log under
/var/local/log/cassandra-reaper.log
or in lumberjack collectionreaper.log
contain an exception for failure to achieve consistency levelQUORUM
orEACH_QUORUM
:
WARN [storagegrid:615635d0-342b-11eb-b6cc-4bacd6a2d5fe:615c9e91-342b-11eb-b6cc-4bacd6a2d5fe] 2020-12-08 18:57:38,140 i.c.s.SegmentRunner - Failed to connect to a coordinator node for segment 615c9e91-342b-11eb-b6cc-4bacd6a2d5fe
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency EACH_QUORUM (2 required but only 0 alive)
- Cassandra reaper repair list from
reaper_commands.txt
in lumberjack collection of storage node(s) or by running this commandspreaper --reaper-host=localhost --reaper-port=9403 status-cluster storagegrid
in SSH session to a storage node, indicates that some or all keyspaces' repairs contain the following message for the last event:
"creation_time": "2020-11-24T23:05:08Z",
"current_time": "2020-12-08T18:59:39Z",
"datacenters": [],
"duration": "7 days 0 hours 2 minutes 13 seconds",
"end_time": "2020-12-01T23:07:22Z",
"estimated_time_of_arrival": null,
"id": "7f8d00b0-2ea9-11eb-b76b-d7a5b22a5393",
"incremental_repair": false,
"intensity": 1.000,
"keyspace_name": "storagegrid",
"last_event": "Postponed a segment because no coordinator was reachable",
"nodes": [],
"owner": "auto-scheduling",
"pause_time": null,
"repair_parallelism": "PARALLEL",
"repair_thread_count": 4,
"repair_unit_id": "dc8dbfa0-17c7-11eb-b890-676ddd59fc8a",
"segments_repaired": 0,
"start_time": "2020-11-24T23:05:08Z",
"state": "ABORTED",
"creation_time": "2020-11-17T20:50:58Z",
"current_time": "2020-12-08T18:59:40Z",
"datacenters": [],
"duration": "7 days 0 hours 0 minutes 32 seconds",
"end_time": "2020-11-24T20:51:31Z",
"estimated_time_of_arrival": null,
"id": "9882a450-2916-11eb-8180-07cae1e33f50",
"incremental_repair": false,
"intensity": 1.000,
"keyspace_name": "reaper_db",
"last_event": "Postponed a segment because no coordinator was reachable",
"nodes": [],
"owner": "auto-scheduling",
"pause_time": null,
"repair_parallelism": "PARALLEL",
"repair_thread_count": 4,
"repair_unit_id": "dc818aa0-17c7-11eb-b890-676ddd59fc8a",
"segments_repaired": 0,
"start_time": "2020-11-17T20:50:59Z",
"state": "ABORTED",