Both admin and gateway nodes rebooted simultaneously during 11.8.0 upgrade and caused an outage
Applies to
NetApp StorageGRID 11.7.0.X to 11.8.0
Issue
- While upgrading StorageGRID firmware from 11.7.0.X to 11.8.0, both admin and gateway nodes rebooted simultaneously to apply the upgrade, resulting in an outage.
- Admin node
install.log
:
[2024-03-20T23:05:30+00:00 INSG] Received SIGTERM; attempting to stop StorageGRID services
[2024-03-20T23:05:30+00:00 INSG]
[2024-03-20T23:05:30+00:00 INSG] *************************
[2024-03-20T23:05:30+00:00 INSG] *** Stopping Services ***
[2024-03-20T23:05:30+00:00 INSG] *************************
[2024-03-20T23:05:30+00:00 INSG]
[2024-03-20T23:05:30+00:00 INSG] [root@<admin_node>:/root] >>> rm -f /var/local/no-reboot-triggered
[2024-03-20T23:05:30+00:00 INSG] [root@<admin_node>:/root] >>> service keepalived stop
[2024-03-20T23:05:30+00:00 INSG] Stopping keepalived: keepalived.
[2024-03-20T23:05:30+00:00 INSG] [root@<admin_node>:/root] >>> ruby /usr/local/lib/site_ruby/bycast/servermanager/stopall.rb --verbose --timeout=720
[2024-03-20T23:05:30+00:00 INSG] remove all error states
[2024-03-20T23:05:31+00:00 INSG] Sent request to add storagegrid_administratively_down metric.
[2024-03-20T23:05:35+00:00 INSG] stopping all services
[2024-03-20T23:05:36+00:00 INSG] stopping acct-tunnel
[2024-03-20T23:05:36+00:00 INSG] stopping alertmanager
[2024-03-20T23:05:36+00:00 INSG] stopping ams
[2024-03-20T23:05:36+00:00 INSG] stopping attrDownPurge
[2024-03-20T23:05:36+00:00 INSG] stopping attrDownSamp1
- Gateway node
install.log
:
[2024-03-20T23:04:53+00:00 INSG] Received SIGTERM; attempting to stop StorageGRID services
[2024-03-20T23:04:53+00:00 INSG]
[2024-03-20T23:04:53+00:00 INSG] *************************
[2024-03-20T23:04:53+00:00 INSG] *** Stopping Services ***
[2024-03-20T23:04:53+00:00 INSG] *************************
[2024-03-20T23:04:53+00:00 INSG]
[2024-03-20T23:04:53+00:00 INSG] [root@<gateway_node>:/root] >>> rm -f /var/local/no-reboot-triggered
[2024-03-20T23:04:53+00:00 INSG] [root@<gateway_node>:/root] >>> service keepalived stop
[2024-03-20T23:04:53+00:00 INSG] Stopping keepalived: keepalived.
[2024-03-20T23:04:53+00:00 INSG] [root@<gateway_node>:/root] >>> ruby /usr/local/lib/site_ruby/bycast/servermanager/stopall.rb --verbose --timeout=720
[2024-03-20T23:04:54+00:00 INSG] remove all error states
[2024-03-20T23:04:54+00:00 INSG] Sent request to add storagegrid_administratively_down metric.
[2024-03-20T23:04:55+00:00 INSG] stopping all services
[2024-03-20T23:04:56+00:00 INSG] stopping acct-tunnel
[2024-03-20T23:04:56+00:00 INSG] stopping dynip
[2024-03-20T23:04:56+00:00 INSG] stopping jaeger-agent
[2024-03-20T23:04:56+00:00 INSG] stopping miscd