StorageGRID storage node DECOM stuck due to site not having enough destination nodes for old EC profile
Applies to
StorageGRID Versions 11.5.0.8 and 11.6.0.7 and earlier.
Issue
Customer is unable to complete decommission of storage node following EC profile change.
EC job decommission error reports on the EC leader (Node_Name). Enabled ECJM level 1 on leader and captured a log bundle. Found below messages ("Selecting destination for EC group failed after 5 retries.") which suggests the decommission is pausing because old EC profile cannot find enough destination in the storage pool as decommission of node Node_Name will leave the pool with only 4 nodes.
Dec 9 19:29:01 Node_Name ADE: |21426716 1820442787 ECJM CSRT 2022-12-09T19:29:01.253077| NOTICE 0376 ECJM: EcgDecomJob: '11696086893380218698' ECG: 'DB1B050F-1755-4F86-995C-81085336DC19' VCS: 'DB349EB5-32DE-40C6-BB52-DA99AEF0A607': Selecting possible destination for affectedBytes: 0
...
Dec 9 19:29:01 Node_Name ADE: |21426716 1820442787 ECJM EPRP 2022-12-09T19:29:01.253925| ERROR 1054 PROC: Exception: /build/src/modules/ErasureCoding/EC_JobManager_Module/EcgDecommissionJob.cc(368): Throw in function void erasurecoding::EcgDecommissionJob::selectDestinationNode()#012Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >#012std::exception::what: ENFORCE failed: !"Selecting destination for EC group failed after 5 retries."#012
Dec 9 19:29:06 Node_Name ADE: |21426716 1820442641 ECJM CSRT 2022-12-09T19:29:06.397947| ERROR 0112 ECJM: Exception caught during decommissioning ENFORCE failed: 'SUCS' == *jobResult.
Dec 9 19:29:06 Node_Name ADE: |21426716 1820442641 ECJM CSRT 2022-12-09T19:29:06.398057| ERROR 1054 PROC: Exception: /build/src/modules/ErasureCoding/EC_JobManager_Module/NodeDecommissionJob.cc(447): Throw in function CXD_AtomContainer erasurecoding::NodeDecommissionJob::waitForJobCompletions()#012Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >#012std::exception::what: ENFORCE failed: 'SUCS' == *jobResult#012