AWS or GCP CVO rebooted due to multiple disks missing
Applies to
- Cloud Volumes ONTAP (CVO)
- Amazon Web Services (AWS)
- Google Cloud Provider (GCP)
Issue
- An AWS / GCP CVO node rebooted with an AutoSupport from the surviving HA partner
HA Group Notification (MULTIPLE DISKS MISSING) ERROR
. - From the surviving node's EMS logs, it can be seen that it has lost access to its mirrored Pool1 disks, which are attached to the failed node:
Mon Jun 03 16:23:02 +0000 [CVO-01: monitor: monitor.globalStatus.critical:EMERGENCY]: This node has taken over CVO-02. One or more mirrored aggregates are degraded.
Mon Jun 03 16:22:35 +0000 [CVO-01: dmgr_thread: raid.disk.missing:info]: Disk /aggr1/plex1/rg0/0d.10 S/N [00000000V9NeubcHXfRG] UID [00000000V9NeubcHXfRG] is missing from the system
Mon Jun 03 16:22:35 +0000 [CVO-01: config_thread: raid.config.filesystem.disk.missing:info]: File system Disk /aggr1/plex1/rg0/0d.10 S/N [00000000V9NeubcHXfRG] UID [00000000V9NeubcHXfRG] is missing.
Note: The above errors are seen for all disks owned by affected node CVO-02.
- Storage failover show output reports the
Previous giveback failed in module: raid
as seen below:
::> storage failover show
Takeover
Node Partner Possible State Description
-------------- -------------- -------- -------------------------------------
CVO-01 CVO-02 false Previous giveback failed in module:
raid
CVO-02 CVO-01 - Waiting for giveback
- EMS logs (below errors may repeat until raid resync completes):
Sat Jul 19 04:15:20 +0000 [CVO-01: cf_main: gb.cfo.abort.raid.fm:error]: Aggregate local:aggr8 is being resynced; canceling giveback.
Sat Jul 19 04:15:20 +0000 [CVO-01: cf_main: cf.rsrc.givebackVeto:alert]: Failover monitor: raid: giveback canceled due to active state.
Sat Jul 19 04:15:20 +0000 [CVO-01: cf_main: cf.fsm.autoGivebackVetoed:error]: Failover monitor: Automatic giveback has been deferred due to long running operations
- Shortly after this event, the following AutoSupport alerts may be generated as a residual symptom of missing disks:
HA Group Notification (SYNCMIRROR PLEX FAILED) ALERT
NODEOQ:HA Group Notification from CVO-02 (NODE(S) OUT OF CLUSTER QUORUM) EMERGENCY
- After the node reboots, it is able to reestablish connectivity to the presented AWS / GCP disks and giveback completes successfully.