AWS or GCP CVO rebooted due to multiple disks missing

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 335

Visibility:: Public

Votes:: 0

Category:: cloud-volumes-ontap-cvo

Specialty:: ds_cvo

Last Updated:

Applies to

Cloud Volumes ONTAP (CVO)
Amazon Web Services (AWS)
Google Cloud Provider (GCP)

Issue

An AWS / GCP CVO node rebooted with an AutoSupport from the surviving HA partner HA Group Notification (MULTIPLE DISKS MISSING) ERROR.
From the surviving node's EMS logs, it can be seen that it has lost access to its mirrored Pool1 disks, which are attached to the failed node:

Mon Jun 03 16:23:02 +0000 [CVO-01: monitor: monitor.globalStatus.critical:EMERGENCY]: This node has taken over CVO-02. One or more mirrored aggregates are degraded.

Mon Jun 03 16:22:35 +0000 [CVO-01: dmgr_thread: raid.disk.missing:info]: Disk /aggr1/plex1/rg0/0d.10 S/N [00000000V9NeubcHXfRG] UID [00000000V9NeubcHXfRG] is missing from the system Mon Jun 03 16:22:35 +0000 [CVO-01: config_thread: raid.config.filesystem.disk.missing:info]: File system Disk /aggr1/plex1/rg0/0d.10 S/N [00000000V9NeubcHXfRG] UID [00000000V9NeubcHXfRG] is missing.

Note: The above errors are seen for all disks owned by affected node CVO-02.

Storage failover show output reports the Previous giveback failed in module: raid as seen below:

::> storage failover show Takeover Node Partner Possible State Description -------------- -------------- -------- ------------------------------------- CVO-01 CVO-02 false Previous giveback failed in module: raid CVO-02 CVO-01 - Waiting for giveback

EMS logs (below errors may repeat until raid resync completes):

Sat Jul 19 04:15:20 +0000 [CVO-01: cf_main: gb.cfo.abort.raid.fm:error]: Aggregate local:aggr8 is being resynced; canceling giveback. Sat Jul 19 04:15:20 +0000 [CVO-01: cf_main: cf.rsrc.givebackVeto:alert]: Failover monitor: raid: giveback canceled due to active state. Sat Jul 19 04:15:20 +0000 [CVO-01: cf_main: cf.fsm.autoGivebackVetoed:error]: Failover monitor: Automatic giveback has been deferred due to long running operations

Shortly after this event, the following AutoSupport alerts may be generated as a residual symptom of missing disks:

HA Group Notification (SYNCMIRROR PLEX FAILED) ALERT

NODEOQ:HA Group Notification from CVO-02 (NODE(S) OUT OF CLUSTER QUORUM) EMERGENCY

After the node reboots, it is able to reestablish connectivity to the presented AWS / GCP disks and giveback completes successfully.