MetroCluster IP Remote site multiple disks failed

Last updated

Jul 31, 2024
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 274

Visibility:: Public

Votes:: 0

Category:: metrocluster

Specialty:: metrocluster

Last Updated:: 7/31/2024, 4:40:51 PM

Applies to

ONTAP 9
MetroCluster

Issue

Flow control was disabled on the MetroCluster IP ports of the cluster switches.
Multiple Disk Failure Event: HA Group Notification from Cluster1-1a (FILESYSTEM DISK NOT RESPONDING) ERROR is reported.
We could see the below error in the cluster

NV mirroring went offline just few seconds before the Cluster network degarding alert

Mon Sep 11 15:03:37 +1000 [Cluster1-1a: nvmm_error: nvmm.mirror.offlined:debug]: params: {'mirror': 'HA_PARTNER'} Mon Sep 11 15:03:37 +1000 [Cluster1-1a: nvmm_error: nvmm.mirror.offlined:debug]: params: {'mirror': 'DR_PARTNER'}

Mon Sep 11 15:03:45 +1000 [Cluster1-1a: vifmgr: vifmgr.port.monitor.failed:debug]: The "link_flapping" health check for port e0c (node Cluster1-1a) has failed. The port is operating in a degraded state. Mon Sep 11 15:03:45 +1000 [Cluster1-1a: vifmgr: callhome.clus.net.degraded:debug]: Call home for CLUSTER NETWORK DEGRADED: Frequent Link Flapping - Cluster port e0c on node Cluster1-1a has experienced multiple link down notification

NV mirror state changes to online after sometime

Mon Sep 11 15:15:44 +1000 [Cluster1-1a: nvmm_mirror_sync: nvmm.mirror.state.change:debug]: mirror of sysid 2, partner_type DR PARTNER, changed state from NVMM_MIRROR_SYNCING_OTHER to NVMM_MIRROR_ONLINE and took 1684 msecs. Mon Sep 11 15:17:09 +1000 [Cluster1-1a: nvmm_mirror_sync: nvmm.mirror.state.change:debug]: mirror of sysid 2, partner_type DR PARTNER, changed state from NVMM_MIRROR_SYNCING_OTHER to NVMM_MIRROR_ONLINE and took 1605 msecs.

Mon Sep 11 15:12:53 +1000 [Cluster1-1b: nvmm_mirror_sync: nvmm.mirror.state.change:debug]: mirror of sysid 2, partner_type DR PARTNER, changed state from NVMM_MIRROR_SYNCING_OTHER to NVMM_MIRROR_ONLINE and took 1540 msecs. Mon Sep 11 15:12:55 +1000 [Cluster1-1b: nvmm_mirror_sync: nvmm.mirror.state.change:debug]: mirror of sysid 1, partner_type HA Partner, changed state from NVMM_MIRROR_SYNCING_OTHER to NVMM_MIRROR_ONLINE and took 1545 msecs

Some or all remote mirrored plexes are offline with drives marked as failed.

Plex /Cluster1-1a_ssd_aggr1/plex1 (offline, failed, inactive, pool1) RAID group /Cluster1-1a_ssd_aggr1/plex1/rg0 (partial) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity FAILED N/A 3630753/ - parity FAILED N/A 3630753/ - data FAILED N/A 3630753/ - data FAILED N/A 3630753/ - data FAILED N/A 3630753/ - data FAILED N/A 3630753/ - data FAILED N/A 3630753/ - data FAILED N/A 3630753/ - data FAILED N/A 3630753/ - data FAILED N/A 3630753/ - data FAILED N/A 3630753/ - Raid group is missing 11 disks. Plex /Cluster1-1a_root/plex12 (offline, failed, inactive, pool1) RAID group /Cluster1-1a_root/plex12/rg0 (partial) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity FAILED N/A 63849/ - parity FAILED N/A 63849/ - data FAILED N/A 63849/ - data FAILED N/A 63849/ - data FAILED N/A 63849/ - Raid group is missing 5 disks.

SITE A:Cluster2

Nodes:
Cluster2-1a - Not having issues
Cluster2-1b - Not having issues

SITE B:Cluster1

Nodes:
Cluster1-1a ---> Have all remote disks failed/missing
Cluster1-1b ---> Not having issues

There are no underlying hardware issues on the storage and the switch.