StorageGRID expansion fails with error: 'Starting Cassandra. Error: Failed to start. Retrying'
Applies to
- StorageGRID Webscale 11.1
- StorageGRID Webscale 11.0
Issue
When attempting to expand a set of nodes, except non-storage node that exhibits 'Complete
', Grid Management Interface (GMI) displays the storage nodes as below:
'Waiting for Cassandra nodes to join the cluster'
'Starting Cassandra. Error: Failed to start. Retrying'
'Waiting to Start Services'
The node's Cassandra log file located under /var/local/log/cassandra/system.log
displays the following error:
ERROR [main] 2018-08-31 10:40:14,250 CassandraDaemon.java (line 678) Exception encountered during startup
java.lang.RuntimeException: A node with address localhost-grid/<IP_Address> already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:559) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:889) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:666) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:614) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:354) [cassandra-all-3.0.15.162564.jar:3.0.15.162564]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:582) [cassandra-all-3.0.15.162564.jar:3.0.15.162564]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:665) [cassandra-all-3.0.15.162564.jar:3.0.15.162564]
Cause
- When expanding a set of nodes, storage nodes need to join as a group at the Cassandra service level.
- If there is one storage node experiencing errors in Cassandra, the other storage node will wait for that node to catch up with the process. Thus, the whole expansion process will pause until errors on that node are resolved.
- In contrast, non-storage nodes, such as admin nodes or gateway nodes, do not have such dependency. Its expansion process can complete without waiting for other nodes.
Solution
Perform the following steps on the storage node that displays the error: 'Starting Cassandra. Error: Failed to start. Retrying
':
- SSH to the node and escalate to root privilege by running:
su -
- Backup the existing Cassandra environmental file
/etc/cassandra/cassandra-env.sh
by running:cp /etc/cassandra/cassandra-env.sh /etc/cassandra/cassandra-env.sh_bk
- Add the following two lines at the end of the Cassandra environmental file:
JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=<SG_Node_IP>"
JVM_OPTS="$JVM_OPTS -Dcassandra.allow_unsafe_replace=true"
- Start the Cassandra service:
service cassandra start
- SSH to a node that displays
Waiting for Cassandra nodes to join the cluster
- Confirm its Cassandra service has been started successfully:
service cassandra status
- Confirm node status at the Cassandra level in the cluster:
nodetool status
- If the node originally in error (step 2) shows 'UN' (up and normal), proceed to next on that error node
- Confirm its Cassandra service has been started successfully:
- Remove or comment out the two lines added in Step 2
- Proceed the expansion process that is interrupted on the storage node in error:
touch /tmp/unhalt
The whole expansion process will be resumed in GMI after a short time.