EC-rebalance fails with Service unavailable. Error contacting EC Job Manager
Applies to
Issue
When trying to perform a EC-rebalance, it reports Service unavailable. Error contacting EC Job Manager.
One or multiple nodes may go in an unknown state from the GUI, but do not actually report any errors. Nodes will go back in online state on their own eventually.
Bycast.log
from EC leader reports:
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ???? 2025-02-04T12:29:51.840953| NOTICE 0040 54375572ef20a272 ECJM: Starting job 1581323391038607979: Site Rebalance - Group ID 10.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.841797| NOTICE 0106 54375572ef20a272 ECJM: Job status of job 1581323391038607979 is JOBSTATUS_IN_PROGRESS
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.841824| NOTICE 0112 54375572ef20a272 ECJM: Resuming job 1581323391038607979
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.841833| NOTICE 0219 54375572ef20a272 ECJM: 1581323391038607979(rebalance 10): Resuming
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM %MDW 2025-02-04T12:29:51.850301| WARNING 1005 54375572ef20a272 ECJM: Volume Info request timed out.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM %MDW 2025-02-04T12:29:51.850336| NOTICE 0994 54375572ef20a272 ECJM: 1581323391038607979(rebalance 10): Cannot determine if there are Offline Volumes in the Grid.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM %MDW 2025-02-04T12:29:51.850347| NOTICE 1271 54375572ef20a272 ECJM: 1581323391038607979(rebalance 10): saving state. status: JOBSTATUS_PAUSED
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.852021| NOTICE 1057 54375572ef20a272 ECJM: 1581323391038607979(rebalance 10): Stopping child jobs.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.852132| WARNING 0062 54375572ef20a272 ECJM: Caught exception 'Failed to ensure all volumes are online. pausing job...' when running job 1581323391038607979: Site Rebalance - Group ID 10.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740691 ECJM _DON 2025-02-04T12:29:51.852204| NOTICE 0934 54375572ef20a272 ECJM: Received job completion message.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740691 ECJM _DON 2025-02-04T12:29:51.852230| NOTICE 0940 54375572ef20a272 ECJM: Job 1581323391038607979 completed with result GERR.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.852232| ERROR 1081 54375572ef20a272 PROC: Exception: Dynamic exception type: std::runtime_error#012std::exception::what: Failed to ensure all volumes are online. pausing job...#012