Skip to main content
NetApp Knowledge Base

StorageGRID expansion fails with error: 'Starting Cassandra. Error: Failed to start. Retrying'

Views:
1,010
Visibility:
Public
Votes:
0
Category:
storagegrid-webscale
Specialty:
sgrid
Last Updated:

Applies to

  • StorageGRID Webscale 11.1
  • StorageGRID Webscale 11.0

Issue

When attempting to expand a set of nodes, except non-storage node that exhibits 'Complete', Grid Management Interface (GMI) displays the storage nodes as below:


'Waiting for Cassandra nodes to join the cluster'
'Starting Cassandra. Error: Failed to start. Retrying'
'Waiting to Start Services'


The node's Cassandra log file located under /var/local/log/cassandra/system.log displays the following error:

ERROR [main] 2018-08-31 10:40:14,250 CassandraDaemon.java (line 678) Exception encountered during startup
java.lang.RuntimeException: A node with address localhost-grid/<IP_Address> already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
    at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:559) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:889) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:666) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:614) ~[cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:354) [cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:582) [cassandra-all-3.0.15.162564.jar:3.0.15.162564]
    at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:665) [cassandra-all-3.0.15.162564.jar:3.0.15.162564]

Cause

  • When expanding a set of nodes, storage nodes need to join as a group at the Cassandra service level.
  • If there is one storage node experiencing errors in Cassandra, the other storage node will wait for that node to catch up with the process. Thus, the whole expansion process will pause until errors on that node are resolved.
  • In contrast, non-storage nodes, such as admin nodes or gateway nodes, do not have such dependency. Its expansion process can complete without waiting for other nodes.

Solution

Perform the following steps on the storage node that displays the error: 'Starting Cassandra. Error: Failed to start. Retrying':

  1. SSH to the node and escalate to root privilege by running: su -
  2. Backup the existing Cassandra environmental file /etc/cassandra/cassandra-env.sh by running: cp /etc/cassandra/cassandra-env.sh /etc/cassandra/cassandra-env.sh_bk
  3. Add the following two lines at the end of the Cassandra environmental file:
    1. JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=<SG_Node_IP>"
    2. JVM_OPTS="$JVM_OPTS -Dcassandra.allow_unsafe_replace=true"
  4. Start the Cassandra service: service cassandra start
  5. SSH to a node that displays Waiting for Cassandra nodes to join the cluster
    1. Confirm its Cassandra service has been started successfully: service cassandra status
    2. Confirm node status at the Cassandra level in the cluster: nodetool status
    3. If the node originally in error (step 2) shows 'UN' (up and normal), proceed to next on that error node
  6. Remove or comment out the two lines added in Step 2
  7. Proceed the expansion process that is interrupted on the storage node in error: touch /tmp/unhalt

The whole expansion process will be resumed in GMI after a short time.

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.