How to confirm the restore status of SGRID's hints file?
WARNING The info is from engineering and not sure whether it is ok to be published for the customer's view so set VISIBILITY as Internal. |
Applies to
StorageGRID (SGRID)
Answer
- For SGRID OS 11.4 (EOVS)
- Just look in
/var/local/lib/cassandra/hints
. If there are files there - then that storage node is holding hints for another storage node.Example:
root@SG-S1-Disk-012:/var/local/lib/cassandra/hints # ls -thlr | grep hints | awk '{print $6 " " $7}'| sort | uniq -c 844 Nov 10 2 Nov 18 94 Nov 2 266 Nov 3 292 Nov 4 332 Nov 5 352 Nov 6 386 Nov 7 330 Nov 8 473 Nov 9 6 Oct 16
- Hint processing is proportional to the number of updates and the duration a node was down. But we can chew through a substantial backlog in just a few minutes.
- Just look in
- For SGRID OS 11.5 or later
- Just look in
/var/local/rangedb/0/cassandra_hints/
.
- Just look in
Example:
- If all nodes are normal or the restoration of hint files has been completed, no hint file exists.
root@DC1-S1:/var/local/rangedb/0/cassandra_hints # ls -l
total 0
- If one storage node gets down and keeps PUT operation to SGRID, the hint file size will increase little by little.
root@DC1-S1:/var/local/rangedb/0/cassandra_hints # ls -l total 55680 -rw-r--r-- 1 cassandra cassandra 24712875Nov 11 01:33 55ca61f3-7702-4a53-b2a1-b281927c941c-1668057738991-1.hints root@DC1-S1:/var/local/rangedb/0/cassandra_hints # ls -l total 55680 -rw-r--r-- 1 cassandra cassandra 26713262
Nov 11 01:35 55ca61f3-7702-4a53-b2a1-b281927c941c-1668057738991-1.hints
- Use 14-day duration hints as below. So outages that are longer than 3 hours (SGRID OS 11.4 or earlier:
max_hint_window_in_ms: 10800000
) will keep hinting.
root@DC1-S1:~ # cat /etc/cassandra/cassandra.yaml | grep hint_window
max_hint_window_in_ms: 1209600000
Additional Information
Q: Regarding "When will the repair be performed using the hint file, such as when the node is rebooted?"
A: When one node sees an update that another node needs to know about - it immediately tries to send that update to the target node.
If the target node does not respond (it is known to be down, it is just particularly busy, etc.), then the update is stored on the sending node as a hint.
The sending node then tries periodically (probably ~once a minute) to send hints to every node it stores hints for.
If that hint is consumed, then it starts to send the rest of them in sequence.
Effectively - if a node is down for a couple of hours - every other node on the cluster has hints for it.
When it comes up, it starts to receive all those hints.
It’s been my experience that those backlogs clear within a couple of minutes of the node being up.
A: When one node sees an update that another node needs to know about - it immediately tries to send that update to the target node.
If the target node does not respond (it is known to be down, it is just particularly busy, etc.), then the update is stored on the sending node as a hint.
The sending node then tries periodically (probably ~once a minute) to send hints to every node it stores hints for.
If that hint is consumed, then it starts to send the rest of them in sequence.
Effectively - if a node is down for a couple of hours - every other node on the cluster has hints for it.
When it comes up, it starts to receive all those hints.
It’s been my experience that those backlogs clear within a couple of minutes of the node being up.
NOTE: Hinting is a data repair technique applied during write operations and contains delete operations as well.