SyncMirror plex failed reported after aggregate healing performed
Applies to
- ONTAP 9
- Metrocluster FC
- FlexArray with NetApp E-Series backend
Issue
- Following a forced-on-disater switchover, aggregate healing is run on the suriving site, which shows successful:
mcc.drsom.fsmStateTrans:debug]: params: {'from_state': 'heal_aggrs_in_progress', 'event': 'success'}mcc.drsom.fsmStateEntry:debug]: params: {'state': 'heal_aggrs_complete'}- It is followed immediately by SyncMirror Plex failures:
raid.assim.rg.missingChild:debug]: Aggregate stor168sp4, rgobj_verify: RAID object 0 has only 4 valid children, expected 5.raid.assim.plex.missingChild:debug]: Aggregate stor168sp4, plexobj_verify: Plex 0 only has 0 working RAID groups (1 total) and is being taken offlineraid.assim.rg.missingChild:debug]: Aggregate stor168sp1, rgobj_verify: RAID object 0 has only 5 valid children, expected 6.raid.assim.plex.missingChild:debug]: Aggregate stor168sp1, plexobj_verify: Plex 0 only has 0 working RAID groups (1 total) and is being taken offlineraid.assim.rg.missingChild:debug]: Aggregate stor168sp15, rgobj_verify: RAID object 0 has only 4 valid children, expected 5.raid.assim.plex.missingChild:debug]: Aggregate stor168sp15, plexobj_verify: Plex 6 only has 0 working RAID groups (1 total) and is being taken offline- Remote mirror plexes on switched over aggregates are missing LUNs:
Aggregate stor168sp4 (online, raid0, mirror degraded) (block checksums) Plex /stor168sp4/plex0 (offline, failed, inactive) RAID group /stor168sp4/plex0/rg0 (partial, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- data FAILED N/A 13972000/ - data lns24bb1:14.126L28 0f - - 0 LUN N/A 13972000/28614656000 14000000/28672000000 data lns24ab1:14.126L29 0e - - 0 LUN N/A 13972000/28614656000 14000000/28672000000 data lns24ab1:14.126L31 0e - - 0 LUN N/A 13972000/28614656000 14000000/28672000000 data lns24bb1:14.126L30 0f - - 0 LUN N/A 13972000/28614656000 14000000/28672000000 Raid group is missing 1 disk.- Missing LUNs are observed in broken pool of switchover cluster and incorrectly owned by switched-over cluster:
Aggregate stor168sp4 (failed, raid0, partial) (block checksums) Plex /stor168sp4/plex0 (offline, failed, inactive) RAID group /stor168sp4/plex0/rg0 (partial, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- data lns24ab1:14.126L27 0e - - 0 LUN N/A 13972000/28614656000 14000000/28672000000 data FAILED N/A 13972000/ - data FAILED N/A 13972000/ - data FAILED N/A 13972000/ - data FAILED N/A 13972000/ - Raid group is missing 4 disks.