StorageGRID DDS service crashes on various storage nodes after recent upgrade to 11.6
Applies to
- NetApp StorageGRID
- DDS Service (Distributed Data Store)
- 11.6.x software release (pre-11.6.0.5)
Issue
- StorageGRID Grid Manager report "Unable to communicate with node" alert for various storage nodes' DDS service.
- DDS service core backtrace found in AutoSupport (
core-backtrace.tgz
)and locally on the node (/var/local/core/dds.xxx.txt
) indicate the service crash is related EC_JobManager_Module.cc.
#9 0x00007f19eabb2708 in boost::archive::xml_iarchive_impl<boost::archive::xml_iarchive>::~xml_iarchive_impl() () from /lib/x86_64-linux-gnu/libboost_serialization.so.1.67.0
#10 0x0000000000a5227d in erasurecoding::RepairJob::PersistentState::load (this=0x7f1939ed48a0, str="<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\" ?>\n<!DOCTYPE boost_serialization>\n<boost_serialization signature=\"serialization::archive\" version=\"14\">\n<persistentState class_id=\"0\" tracking_lev"...) at /build/src/modules/ErasureCoding/EC_JobManager_Module/RepairJob.cc:948
#11 0x0000000000a109f5 in (anonymous namespace)::calculateRepairStatus (str=...) at /build/src/modules/ErasureCoding/EC_JobManager_Module/EC_JobManager_Module.cc:198
#12 0x0000000000a10772 in EC_JobManager::EC_JobManager_Module::getBytesFromJob (this=<optimized out>, persistentJob=...) at /build/src/modules/ErasureCoding/EC_JobManager_Module/EC_JobManager_Module.cc:1472
#13 0x0000000000a0609d in EC_JobManager::EC_JobManager_Module::handleUpdateMetrics (this=0x7f1939ed4d78) at /build/src/modules/ErasureCoding/EC_JobManager_Module/EC_JobManager_Module.cc:1543
#14 0x0000000000a021cd in EC_JobManager::EC_JobManager_Module::run (this=0x7f1939ed4d78) at /build/src/modules/ErasureCoding/EC_JobManager_Module/EC_JobManager_Module.cc:818
#15 0x0000000000a020cd in EC_JobManager_Module () at /build/src/modules/ErasureCoding/EC_JobManager_Module/EC_JobManager_Module.cc:738