StorageGRID EC rebalance shows state as failed but all moves completed
Applies to
Issue
- EC rebalance status shows the state as failed but all moves completed:
-
root@<primaryadminnode>:~ # rebalance-data status --job-id XXXXX
==============================================================================
Job ID : XXXXX
Site : <site_name>
State : Failure
Planned Moves : 800
Completed Moves : 800
Failed Moves : 0
Start Time : 2025-04-24 22:37:18 UTC
End Time : 2025-05-31 09:34:42 UTC
Site Imbalance : 0.17611
Retry Rebalance : Yes
-
- The
/var/local/log/bycast.log
on the storage node that is the ECJM leader, shows the job completed successfully followed by a time out:
May 31 09:34:34 <node_name> ADE: |21316817 0203632337 ECJM VMPR 2025-05-31T09:34:34.780492| NOTICE 0891 db566129974939cb ECJM: XXXXX(rebalance 20): Got JOB_DONE for job XXXXX; with result SUCS
May 31 09:34:34 <node_name> ADE: |21316817 0203632337 ECJM _DON2025-05-31T09:34:34.856580| NOTICE 0591 db566129974939cb ECJM:XXXXX(rebalance 20): Finished one round of VCS moves,terminated: true, _status: JOBSTATUS_IN_PROGRESS
May 31 09:34:42 <node_name> ADE: |21316817 0203632337 ECJM EVIU 2025-05-31T09:34:42.344063| WARNING 0402 db566129974939cb ECJM:XXXXX(rebalance 20): Rebalance job failed to achieve sufficient site balance within timeout. Stopping.
May 31 09:34:42 <node_name> ADE: |21316817 0203632337 ECJM ^RDY 2025-05-31T09:34:42.489279| WARNING 0062 db566129974939cb ECJM: Caught exception 'Stopping job due to timeout.' when running job 15197037317273935349: Site Rebalance - Group ID 20.
May 31 09:34:42 <node_name> ADE: |21316817 0203632337 ECJM ^RDY 2025-05-31T09:34:42.529522| ERROR 1125 db566129974939cb PROC: Exception: Dynamic exception type: std::runtime_error#012std::exception::what: Stopping job due to timeout.#012