EC-rebalance fails with Service unavailable. Error contacting EC Job Manager
Applies to
Issue
EC-rebalance reports: Service unavailable. Error contacting EC Job Manager. One or multiple nodes may go in an unknown state from the GUI, but do not actually report any errors. Nodes will go back in online state on their own eventually.
Bycast.log from EC leader reports:
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ???? 2025-02-04T12:29:51.840953| NOTICE 0040 54375572ef20a272 ECJM: Starting job 1581323391038607979: Site Rebalance - Group ID 10.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.841797| NOTICE 0106 54375572ef20a272 ECJM: Job status of job 1581323391038607979 is JOBSTATUS_IN_PROGRESS
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.841824| NOTICE 0112 54375572ef20a272 ECJM: Resuming job 1581323391038607979
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.841833| NOTICE 0219 54375572ef20a272 ECJM: 1581323391038607979(rebalance 10): Resuming
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM %MDW 2025-02-04T12:29:51.850301| WARNING 1005 54375572ef20a272 ECJM: Volume Info request timed out.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM %MDW 2025-02-04T12:29:51.850336| NOTICE 0994 54375572ef20a272 ECJM: 1581323391038607979(rebalance 10): Cannot determine if there are Offline Volumes in the Grid.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM %MDW 2025-02-04T12:29:51.850347| NOTICE 1271 54375572ef20a272 ECJM: 1581323391038607979(rebalance 10): saving state. status: JOBSTATUS_PAUSED
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.852021| NOTICE 1057 54375572ef20a272 ECJM: 1581323391038607979(rebalance 10): Stopping child jobs.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.852132| WARNING 0062 54375572ef20a272 ECJM: Caught exception 'Failed to ensure all volumes are online. pausing job...' when running job 1581323391038607979: Site Rebalance - Group ID 10.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740691 ECJM _DON 2025-02-04T12:29:51.852204| NOTICE 0934 54375572ef20a272 ECJM: Received job completion message.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740691 ECJM _DON 2025-02-04T12:29:51.852230| NOTICE 0940 54375572ef20a272 ECJM: Job 1581323391038607979 completed with result GERR.
Feb 4 12:29:51 <nodename> ADE: |21835099 0110740693 ECJM ^RDY 2025-02-04T12:29:51.852232| ERROR 1081 54375572ef20a272 PROC: Exception: Dynamic exception type: std::runtime_error#012std::exception::what: Failed to ensure all volumes are online. pausing job...#012
bycast-err.log may report:ERROR Internal server error. The server encountered an error and could not complete your request. Try again. If the problem persists, contact support. EC job manager unavailable. (MgmtApi::LocalizedRuntimeError)ERROR /usr/local/lib/site_ruby/mgmt-api/rest-client/resource.rb:833:in `handle_errors!'ERROR /usr/local/lib/site_ruby/mgmt-api/rest-client/resource.rb:486:in `all'ERROR /usr/local/lib/site_ruby/mgmt-api/data-recovery/data-recovery.rb:696:in `get_status'ERROR Failed to retrieve erasure-coded repair status, EC job manager might be down.ERROR Failed to retry erasure-coded repair due to EC job manager unavailable.ERROR DataRecoveryManager failed to retry EC repairERROR Failed to retry repair: EC job manager unavailable. (MgmtApi::LocalizedRuntimeError)