StorageGRID services state changed to unknown due to out of memory
Applies to
- StorageGRID
- DDS service (Distributed Data Store)
- LDR service (Local Distribution Router)
- SSM service (Server Status Monitor)
Issue
- State of StorageGRID services like DDS, LDR and SSM of a Storage Node change to unknown and recover after a few minutes.
- servermanager.log indicates Cassandra service is ended and restarted:
2021-01-23 12:34:38 +0000 | cassandra | cassandra ended
2021-01-23 12:34:54 +0000 | cassandra | starting cassandra
- Base OS messages log shows Java process (the Cassandra service for StorageGRID) is killed by oom_reaper:
Jan 23 12:34:22 localhost kernel: [123456.123456] oom_reaper: reaped process 1234 (java), now anon-rss:10347420kB, file-rss:27560kB, shmem-rss:144kB
- StorageGRID node reboots due to Out-of-Memory errors found in daemon.log
Line 26927: Mar 22 13:39:37 localhost wdogd[1691]: OOMM: successfully forked OOM canary process Line 26967: Mar 22 13:39:38 localhost wdogd[1691]: OOMM: /usr/bin/storagegrid-oom-recover considering initializing swap file at Tue Mar 22 13:39:38 UTC 2022 Line 27033: Mar 22 13:39:41 localhost wdogd[1691]: OOMM: Setting up swapspace version 1, size = 1024 MiB (1073737728 bytes) Line 27034: Mar 22 13:39:41 localhost wdogd[1691]: OOMM: no label, UUID=