StorageGRID EC rebalance not progressing/stuck terminating
Applies to
StorageGRID 11.6.0.13
Issue
EC rebalance job stuck (no longer making any progress) after upgrading placing nodes in maintenance mode during rebalance and updating E-Series controller firmware.
Attempting to cancel the EC rebalance job leaves it in a Terminating state:
===================================================================
Job ID : <job-ID>
Site : <sitename>
State : Terminating...
Total Moves : 1712
Completed Moves : 1555
Canceled Moves : 22
Failures (retryable) : 0
Failures (non—retryable) : 0
Percentage : 91
Start Time : 2023-12-07 05:26:01 UTC
Retry Rebalance : No
===================================================================
In bycast logs, abort is attempted but never succeeds:
Feb 20 14:33:03 <node-name> ADE: |<node-id> 0484709853 ECJM #ABT 2024-02-20T14:33:03.027448| WARNING 0242 2e9d1becf0c43622 ECJM: Handling job abort for job 12085808753338013043
Feb 20 14:33:03 <node-name> ADE: |<node-id> 0484709853 ECJM #ABT 2024-02-20T14:33:03.036515| WARNING 0258 2e9d1becf0c43622 ECJM: Sending abort message to running job 12085808753338013043
Feb 20 14:33:03 <node-name> ADE: |<node-id> 0484709853 ECJM #ABT 2024-02-20T14:33:03.042362| INFO 0549 2e9d1becf0c43622 ECJM: Informing process 484739606, message: #ABT
Feb 20 14:33:03 <node-name> ADE: |<node-id> 0484739606 ECJM #DON 2024-02-20T14:33:03.043585| WARNING 0017 ECJM: Unexpected message from ECJM:{484709853@21228096}: [#ABT:[JBID(UI64):12085808753338013043]]