Skip to main content
NetApp Knowledge Base

StorageGRID expansion fails with error: 'Starting Cassandra. Error: Failed to start. Retrying'

Views:
440
Visibility:
Public
Votes:
0
Category:
storagegrid-webscale
Specialty:
sgrid
Last Updated:

Applies to

  • StorageGRID Webscale 11.1
  • StorageGRID Webscale 11.0

Issue

When attempting to expand a set of nodes, except non-storage node that exhibits 'Complete', Grid Management Interface (GMI) displays the storage nodes as below:


'Waiting for Cassandra nodes to join the cluster'
'Starting Cassandra. Error: Failed to start. Retrying'
'Waiting to Start Services'


The node's Cassandra log file located under /var/local/log/cassandra/system.log displays the following error:

ERROR [main] 2018-08-31 10:40:14,250 CassandraDaemon.java (line 678) Exception encountered during startup
java.lang.RuntimeException: A node with address localhost-grid/<IP_Address> already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
    at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:559) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:889) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:666) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:614) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:354) [cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:582) [cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:665) [cassandra-all-3.0.15.162564.jar:3.0.15.162564]

Cause

  • When expanding a set of nodes, storage nodes need to join as a group at the Cassandra service level.
  • If there is one storage node experiencing errors in Cassandra, the other storage node will wait for that node to catch up with the process. Thus, the whole expansion process will pause until errors on that node are resolved.
  • In contrast, non-storage nodes, such as admin nodes or gateway nodes, do not have such dependency. Its expansion process can complete without waiting for other nodes.

Solution

Perform the following steps on the storage node that displays the error: 'Starting Cassandra. Error: Failed to start. Retrying':

  1. SSH to the node and escalate to root privilege by running: su -
  2. Backup the existing Cassandra environmental file /etc/cassandra/cassandra-env.sh by running: cp /etc/cassandra/cassandra-env.sh /etc/cassandra/cassandra-env.sh_bk
  3. Add the following two lines at the end of the Cassandra environmental file:
    1. JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=<SG_Node_IP>"
    2. JVM_OPTS="$JVM_OPTS -Dcassandra.allow_unsafe_replace=true"
  4. Start the Cassandra service: service cassandra start
  5. SSH to a node that displays Waiting for Cassandra nodes to join the cluster
    1. Confirm its Cassandra service has been started successfully: service cassandra status
    2. Confirm node status at the Cassandra level in the cluster: nodetool status
    3. If the node originally in error (step 2) shows 'UN' (up and normal), proceed to next on that error node
  6. Remove or comment out the two lines added in Step 2
  7. Proceed the expansion process that is interrupted on the storage node in error: touch /tmp/unhalt

The whole expansion process will be resumed in GMI after a short time.

 

CUSTOMER EXCLUSIVE CONTENT

Registered NetApp customers get unlimited access to our dynamic Knowledge Base.

New authoritative content is published and updated each day by our team of experts.

Current Customer or Partner?

Sign In for unlimited access

New to NetApp?

Learn more about our award-winning Support