Skip to main content
NetApp Knowledge Base

S3/Swift requests return ServiceUnavailable failure along with node decommissioning

Views:
257
Visibility:
Public
Votes:
2
Category:
storagegrid-webscale
Specialty:
sgrid
Last Updated:

Applies to

StorageGRID OS 11.6

Issue

  • S3/Swift requests return ServiceUnavailable failure along with node decommissioning.
  • Alarms below also occur in the meantime:
    • SLSA    (CPU Load Average)
    • RORQ (Outbound Replications - Queued)
    • RIRQ    (Inbound Replications - Queued)
  • Bycast log indicates that the requests fail due to Cassandra TimeoutException :
    • HTTP Status Code=503, ErrorMsg=ServiceUnavailable, ErrorType=Client, CustomErrorMessage={<none>}, Details={<none>}
    • OBDI: checkForPreExistingObject Cassandra TimeoutException (Failed to execute cql at consistency TWO: SELECT event_time, event, last_access_time, object_lock_mode, object_lock_retain_until_time, object_lock_legal_hold, user_metadata, writetime(user_metadata), content_type, writetime(content_type), restore_start_time, restore_expiry_time, retier_time, object_partially_tiered FROM storagegrid.object_by_uuid WHERE uuid = 5595C096-928D-4CAF-B8D8-E03A4865304F - Cassandra Driver Error(Read timeout):'Operation timed out - received only 14 responses.' Detailed Info:[consistency: ALL, responses_received: 14, responses_required: 15, data_present: 1])
  • Prometheus data indicates
  1. CPU usage of a specific node that is under decommissioning is outstanding.
    sum by (instance) (sum by (instance, mode) (irate(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}[5m])) / count by (instance, mode)(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}))
    NOTEst is the common initial letter of all the storage nodes.
    S3/Swift requests return ServiceUnavailable failure
  2. iowait of this specific node gets increased 5 times (10% to 50%) along with decommissioning, which means the disk system is the bottleneck.
    sum by (mode)(irate(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'}[5m])) * 100 / count by (mode)(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'})
    ServiceUnavailable failure
  3. The usage of all disks of this specific node is almost 100%. 
    irate(node_disk_io_time_seconds_total{instance="issued storage node name",device=~'^sd.*'}[5m])*100
    S3/Swift requests return ServiceUnavailable failure along with node decommissioning
  • Comparing two decommissioned nodes on how the filesystem free bytes increased post-decommissioning, the steepness occurred on the bad node during the initial stage, which proves that the issued node has more read and truncate activity, during the initial stage of decommissioning. 
    • sum(node_filesystem_free_bytes{instance="node name",mountpoint=~"/var/local/rangedb/.*"})
      • 2023/7/5/13:16 GMT ~ 2023/7/5/14:36 GMT
        • Bad node:    724.45TB - 724.18TB = 0.27TB = 270GB
        • Good node: 528.47TB - 528.45TB = 0.02TB = 20GB
      • 2023/7/5/13:16 GMT ~ 2023/7/6/02:04 GMT
        • Bad node:    725.00TB - 724.18TB = 0.82TB = 820GB
        • Good node: 528.57TB - 528.45TB = 0.12TB = 120GB
          • node_filesystem_free_bytes.png
  • Comparing performance data between the issued node and another node in their daily ASUP, the bad node had higher read / write latency due to higher IOPS and throughput:
ASUP -> STATE-CAPTURE-DATA
Executing ionShow(99,0,0,0,0,0,0,0,0,0) on controller A:

Bad node:

-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
   2 Hst :  51070465 3050503068160     23246   1869666 :  24067972 379745803264     45470  13645260 :    0
   3 Hst :  50889777 3049366095360     23310   1760814 :  24248943 380225977344     45183  13645220 :    0
 
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
   0 Drv : 256171408 35181547092992     17239    852896 :  82234342 1336298067456      2512    286906 :    0
   4 Drv :       288     294912      4258      4241 :         0          0         0         0 :    0
 
Seconds since statistics cleared: 86411

Good node:

-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
   2 Hst :  27647780 2876604737536      5274    829929 :  11826653 237424963584       131    511517 :    0
   3 Hst :  27509975 2877446842368      5303    826519 :  12073420 238340426240       131    620620 :    0
 
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
   0 Drv : 136207478 28042508481024      3965    325577 :   7641267 528941565952      4254     45393 :    0
   4 Drv :       288     294912      4301      4219 :         0          0         0         0 :    0
 
Seconds since statistics cleared: 86411

NOTE:

  • R E A D S = S3/Swift's GET request
  • W R I T E S =  S3/Swift's PUT request
  • ByteXfered = throughput
  • Success  = IOPS 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.