StorageGRID decommission job gets stuck against ADC node
Applies to
- NetApp StorageGRID
- 11.6.x software release (pre-11.6.0.9)
- Nodes with ADC (Administrative Domain Controller) service running
Issue
- When decommission ADC nodes, job gets stuck and can not be resumed
/var/local/log/decommission.d-rsm.log
shows 500 Internal Server Error
DELETE to RSM service returned curl: (22) The requested URL returned error: 500 Internal Server Error500. Retrying in 10s
- When issuing
curl http://localhost:8003/v1/raft/status
command against all nodes in the same site, the results show different entries.
Example:
node1 displays 3 entries, but node2 shows 4.
root@node1:~ $ curl http://localhost:8003/v1/raft/status
...
"clusterStatus": {
"11286219": {
"IPAddress": x.x.x.x:18003",
"isLeader": false,
"raftId": "ac36cb"
},
"11702624": {
"IPAddress": "x.x.x.x:18003",
"isLeader": true,
"raftId": "b29160"
},
"11791281": {
"IPAddress": "x.x.x.x:18003",
"isLeader": false,
"raftId": "b3ebb1"
}
},
...
root@node2:~ # curl http://localhost:8003/v1/raft/status
...
"clusterStatus": {
"11286219": {
"IPAddress": "x.x.x.x:18003",
"isLeader": false,
"raftId": "ac36cb"
},
"11702624": {
"IPAddress": "x.x.x.x:18003",
"isLeader": true,
"raftId": "b29160"
},
"11791281": {
"IPAddress": "x.x.x.x:18003",
"isLeader": false,
"raftId": "b3ebb1"
},
"11793293": {
"IPAddress": "x.x.x.x:18003",
"isLeader": false,
"raftId": "b3f38d"
}
},
...