Switchless cluster hits bug 1253791 then suffers power loss resulting in cluster app quorum issues

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 407

Visibility:: Public

Votes:: 0

Category:: ontap-9

Specialty:: core

Last Updated:

Applies to

FAS2720
ONTAP 9
Two node switchless cluster

Issue

One node previously panicked due to quorum loss as a result of bug 1253791 (e0a/e0b cluster ports go link down)
Partial giveback because cluster apps cannot come online with cluster ports down, with storage failover show reporting:

Waiting for cluster applications to come online on the local node

Power loss while in this state power cycles both nodes
Previous node that had taken over/was cluster master, comes up with cluster applications offline with following error after boot:

Internal error: Cannot open corrupt replicated database. Automatic recovery

attempt has failed or is disabled. Check the event logs for details. This node

is not fully operational. Contact support personnel for the root volume recovery

procedures.

Attempting to clear the bootarg.rdb_corrupt state through recovery procedures, taken over node becomes master for mgwd but other apps report "-" and previous master is secondary for mgwd with other apps offline
Example: Node cluster1-01 was the node that had originally panicked due to quorum loss as a result of bug 1253791, node 02 had taken over and was master before power loss/rdb recovery

Node 01 cluster ring show after rdb recovery:

::> set advanced

::*> cluster ring show

Node UnitName Epoch DB Epoch DB Trnxs Master Online ----------- -------- -------- -------- -------- ----------- --------- cluster1-01 mgmt 21 21 107 cluster1-01 master cluster1-01 vldb - - - - - cluster1-01 vifmgr - - - - - cluster1-01 bcomd - - - - - cluster1-01 crs - - - - - cluster1-02 mgmt 21 21 107 cluster1-01 secondary cluster1-02 vldb 0 18 3295 - offline cluster1-02 vifmgr 0 20 50 - offline cluster1-02 bcomd 0 19 6 - offline cluster1-02 crs 0 18 1 - offline

Node 02 cluster ring show after rdb recovery:

Node UnitName Epoch DB Epoch DB Trnxs Master Online ----------- -------- -------- -------- -------- ----------- --------- cluster1-01 crs - - - - - cluster1-02 mgmt 21 21 109 cluster1-01 secondary cluster1-02 vldb 0 18 3295 - offline cluster1-02 vifmgr 0 20 50 - offline cluster1-02 bcomd 0 19 6 - offline cluster1-02 crs 0 18 1 - offline