Skip to main content

This Site will be down for up to 3 hours on December 2, 2023 from 8 PM - 11 PM PST, to deploy an infrastructure update.

NetApp Knowledge Base

How to confirm the restore status of SGRID's hints file?

Views:
77
Visibility:
Public
Votes:
0
Category:
storagegrid-webscale
Specialty:
sgrid
Last Updated:

WARNING

The info is from engineering and not sure whether it is ok to be published for the customer's view so set VISIBILITY as Internal.

 

Applies to

StorageGRID (SGRID)

Answer

  • For SGRID OS 11.4 (EOVS)
    • Just look in /var/local/lib/cassandra/hints.  If there are files there - then that storage node is holding hints for another storage node.

      Example:

      root@SG-S1-Disk-012:/var/local/lib/cassandra/hints # ls -thlr | grep hints | awk '{print $6 " " $7}'| sort | uniq -c
          844 Nov 10
            2 Nov 18
           94 Nov 2
          266 Nov 3
          292 Nov 4
          332 Nov 5
          352 Nov 6
          386 Nov 7
          330 Nov 8
          473 Nov 9
            6 Oct 16
    • Hint processing is proportional to the number of updates and the duration a node was down.  But we can chew through a substantial backlog in just a few minutes.
  • For SGRID OS 11.5 or later
    • Just look in /var/local/rangedb/0/cassandra_hints/.

Example:

  • If all nodes are normal or the restoration of hint files has been completed, no hint file exists.
root@DC1-S1:/var/local/rangedb/0/cassandra_hints # ls -l
total 0
  • If one storage node gets down and keeps PUT operation to SGRID, the hint file size will increase little by little.
root@DC1-S1:/var/local/rangedb/0/cassandra_hints # ls -l
total 55680
-rw-r--r-- 1 cassandra cassandra 24712875 Nov 11 01:33 55ca61f3-7702-4a53-b2a1-b281927c941c-1668057738991-1.hints
root@DC1-S1:/var/local/rangedb/0/cassandra_hints # ls -l
total 55680
-rw-r--r-- 1 cassandra cassandra 26713262 Nov 11 01:35 55ca61f3-7702-4a53-b2a1-b281927c941c-1668057738991-1.hints
  • Use 14-day duration hints as below. So outages that are longer than 3 hours (SGRID OS 11.4 or earlier: max_hint_window_in_ms: 10800000) will keep hinting.

root@DC1-S1:~ # cat /etc/cassandra/cassandra.yaml | grep hint_window
max_hint_window_in_ms: 1209600000

Additional Information

Q: Regarding "When will the repair be performed using the hint file, such as when the node is rebooted?"
A: When one node sees an update that another node needs to know about - it immediately tries to send that update to the target node.
If the target node does not respond (it is known to be down, it is just particularly busy, etc.), then the update is stored on the sending node as a hint.
The sending node then tries periodically (probably ~once a minute) to send hints to every node it stores hints for.
If that hint is consumed, then it starts to send the rest of them in sequence.
Effectively - if a node is down for a couple of hours - every other node on the cluster has hints for it.
When it comes up, it starts to receive all those hints.
It’s been my experience that those backlogs clear within a couple of minutes of the node being up.
NOTE: Hinting is a data repair technique applied during write operations and contains delete operations as well.
NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.
Scan to view the article on your device