Site expansion stuck on Cassandra rebuild
Applies to
Issue
Site expansion stuck as a result of expansion process picking a sub-optimal site for Cassandra nodetool rebuild.

Cassandra system.log shows:
 INFO [RMI TCP Connection(22710)-127.0.0.1] 2024-03-07 05:30:08,752 RangeStreamer.java (line 127) Rebuild: range (7543867250329265734,7544102375703298946] exists on /<IP_3H> for keyspace accounts
 INFO [RMI TCP Connection(22710)-127.0.0.1] 2024-03-07 05:30:08,753 RangeStreamer.java (line 127) Rebuild: range (7543867250329265734,7544102375703298946] exists on /<IP_3I> for keyspace accounts
 WARN [RMI TCP Connection(22710)-127.0.0.1] 2024-03-07 05:30:08,753 StorageService.java (line 1503) Parameter error while rebuilding node
java.lang.IllegalStateException: Unable to find sufficient sources for streaming range (-6636921090683170249,-6636701783084431689] in keyspace accounts
at org.apache.cassandra.dht.RangeStreamer.handleSourceNotFound(RangeStreamer.java:306)
at org.apache.cassandra.dht.RangeStreamer.getRangeFetchMap(RangeStreamer.java:285)
at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:129)
at org.apache.cassandra.service.StorageService.rebuild(StorageService.java:1429)
at org.apache.cassandra.service.StorageService.rebuild(StorageService.java:1343)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:72)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:276)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1468)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:829)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
at sun.rmi.transport.Transport$1.run(Transport.java:200)
at sun.rmi.transport.Transport$1.run(Transport.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
 INFO [RMI TCP Connection(22716)-127.0.0.1] 2024-03-07 05:30:20,425 StorageService.java (line 1402) starting rebuild for (All keyspaces), (All tokens), RESET_NO_SNAPSHOT,  included DCs: group20
The nodetool status command, run from expansion site (site 4 / group40) shows nodes in site 2 (group20) as "DS" (Down/Stopped)
However, nodes in site 2 (group20) are up and running, and accessible from other sites.
Datacenter: group10
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address       Load       Tokens       Owns (effective)  Host ID    Rack
UN  <IP_1A>       1.77 TiB   256          50.9%             <UUID_1A>    unknown
UN  <IP_1B>       1.52 TiB   256          49.0%             <UUID_1B>    unknown
UN  <IP_1C>       1.48 TiB   256          49.2%             <UUID_1C>    unknown
UN  <IP_1D>       1.68 TiB   256          49.5%             <UUID_1D>    unknown
UN  <IP_1E>       1.77 TiB   256          51.4%             <UUID_1E>    unknown
UN  <IP_1F>       1.56 TiB   256          49.9%             <UUID_1F>    unknown
Datacenter: group20
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address       Load       Tokens       Owns (effective)  Host ID    Rack
DS  <IP_2A>       3.06 TiB   256          100.0%            <UUID_2A>    unknown
DS  <IP_2B>       3.13 TiB   256          100.0%            <UUID_2B>    unknown
DS  <IP_2C>       2.99 TiB   256          100.0%            <UUID_2C>    unknown
Datacenter: group30
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address       Load       Tokens       Owns (effective)  Host ID    Rack
UN  <IP_3A>       1.18 TiB   256          33.4%             <UUID_3A>    unknown
UN  <IP_3B>       1.03 TiB   256          33.6%             <UUID_3B>    unknown
UN  <IP_3C>       1.1 TiB    256          34.8%             <UUID_3C>    unknown
UN  <IP_3D>       1.03 TiB   256          31.4%             <UUID_3D>    unknown
UN  <IP_3E>       1.02 TiB   256          34.1%             <UUID_3E>    unknown
UN  <IP_3F>      964.94 GiB  256          32.1%             <UUID_3F>    unknown
UN  <IP_3G>       1.02 TiB   256          34.7%             <UUID_3G>    unknown
UN  <IP_3H>      969.99 GiB  256          31.6%             <UUID_3H>    unknown
UN  <IP_3I>       1.1 TiB    256          34.3%             <UUID_3I>    unknown
Datacenter: group40
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
--  Address       Load       Tokens       Owns (effective)  Host ID    Rack
UN  <IP_4A>       10.35 GiB  256          37.1%             <UUID_4A>    unknown
UN  <IP_4B>       7.62 GiB   256          35.7%             <UUID_4B>    unknown
UN  <IP_4C>       4.83 GiB   256          39.9%             <UUID_4C>    unknown
UN  <IP_4D>       11.75 GiB  256          40.1%             <UUID_4D>    unknown
UN  <IP_4E>       10.69 GiB  256          38.8%             <UUID_4E>    unknown
UN  <IP_4F>       6.12 GiB   256          35.2%             <UUID_4F>    unknown
UN  <IP_4G>       8.06 GiB   256          37.0%             <UUID_4G>    unknown
UN  <IP_4H>       14.86 GiB  256          36.0%             <UUID_4H>    unknown
