Skip to main content
NetApp Knowledge Base

Troubleshooting Workflow: MHost RDB apps out of quorum

Views:
3,683
Visibility:
Public
Votes:
1
Category:
ontap-9
Specialty:
CORE
Last Updated:
2/11/2025, 9:20:44 AM

Applies to

  • ONTAP 9

Issue

Commands and services might stop functioning or function in a limited capacity when an RDB application is Out-Of-Quorum (that is, 'Local unit offline').

This is typically a transitional state, due to network partition, health of the remote nodes, or health of the local node.

The RDB cluster configuration consists of a defined set of replication sites (nodes), all of which are known to one another. Cluster membership and configuration is stored within the replicated file /var/rdb/_sitelist. All RDB applications or rings (mgwd, vldb, vifmgr, bcomd, and so on) ring share the sitelist configuration.

_sitelist (cluster configuration data) is automatically replicated within the system. The contents include the following:

  • Version
  • Cluster UUID
  • List of sites.

Each site has an ID, a hostname, a pair of cluster IP addresses, and a state (eligible/ineligible). The eligibility setting governs whether this site will participate in quorum formation - this is an administrative choice. Additionally, one site might be designated as holding 'epsilon' - an extra partial vote that allows quorum to form with only half the sites. 'Epsilon' is not the same as 'master'. In the two-node HA mode, _sitelist contains the HA_CONFIG attribute; this implies that completely different rules are in effect for quorum handling.

A quorum is a connected majority of like RDB apps, with one instance selected as the master. The master is usually one of the first several instances in the _sitelist. Each replication ring operates completely independent of the other rings. It is normal for different rings to have different masters, but typically located on the same node.

A node that is Out of Quorum (OOQ) is not a participating member of a quorum. That is, it has not yet participated in quorum formation (just booting up) or lost contact with the master, either because it has taken itself OOQ or the master has pushed it OOQ.

In the offline state, databases cannot be written or updated by the master of a quorum. However, a local point-in-time read-only copy of databases are available. How useful the read-only copy is depends on the specific RDB app. For example the vldb might continue to answer queries from the N-blade when offline. It is necessary to consult owners of various apps.

RDB apps compete with D-blade and N-blade for CPU and I/O cycles. The system is not a real-time system and does not have Service Level Agreements (SLAs) planned for a future release. Therefore, RDB apps go OOQ occasionally on heavily loaded systems. This condition is not a bug.

When a local or remote node is OOQ, CLI commands call fail with 'Local unit offline' in the error message (some commands automatically retry when offline is encountered unbeknownst to the admin). When this happens, the command should be retried before digging deeper, as the condition is usually transitory.

If any of the following issues occur on the master, all apps go offline momentarily until a new master is elected.

Advanced commands:
To investigate the state of the quorum for all rings, use the advanced level command cluster ring show.

::>set advanced

::*> cluster ring show
Node      UnitName Epoch    DB Epoch DB Trnxs Master    Online
--------- -------- -------- -------- -------- --------- ---------
csiptc-2240-09 mgmt 88     88       917522   csiptc-2240-09 master
csiptc-2240-09 vldb 90     90       3889     csiptc-2240-09 master
csiptc-2240-09 vifmgr 87   87       308046   csiptc-2240-09 master
csiptc-2240-09 bcomd 86    86       10       csiptc-2240-09 master
csiptc-2240-09 crs 87      87       107      csiptc-2240-09 master
csiptc-2240-10 mgmt 88     88       917522   csiptc-2240-09 secondary
csiptc-2240-10 vldb 90     90       3889     csiptc-2240-09 secondary
csiptc-2240-10 vifmgr 87   87       308046   csiptc-2240-09 secondary
csiptc-2240-10 bcomd 86    86       10       csiptc-2240-09 secondary
csiptc-2240-10 crs 87      87       107      csiptc-2240-09 secondary
10 entries were displayed.




The cluster show command displays only the quorum state of mgwd (use cluster ring show and rdb_dump for all rings).

::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
csiptc-2240-09       true    true          false
csiptc-2240-10       true    true          false
2 entries were displayed.

Systemshell commands:

  1. Enter diagnostic mode: set diag 
  2. Enter the systemshell on the appropriate node (you may have to unlock the diag user account): systemshell -node <node-name>

To investigate the state of the quorum when mgwd is not running, run rdb_dump on the FreeBSD shell. From any cluster node, use the tool to extract current state information for any or all of the RDB applications. The typical technique is to cat the /var/rdb/_sitelist, then use the rdb_dump tool to investigate by directing it at the IP addresses of interest (or at localhost). rdb_dump is capable of showing:

  • Overall health
  • Transaction flow
  • Database versions
  • Various components and internals.

Type rdb_dump -h for a list of options. Note that all rdb_dump output is from the point of view of the process that is being queried.

% rdb_dump -h

rdb_dump [<host>] [options] <unit>*
   -h       - help
   -c [n]   - continuous with n sec delay (default 3)
   -v       - verbose; all options other than 'c'
   -e       - environment vars
   -f       - configuration info
   -x       - internal developer info on selected components
   -u       - Local Unit
   -d       - individual database summary
   -q       - Quorum Mgr
   -r       - Recovery Mgr
   -t       - Transaction Mgr
   -z       - Call exportHealth API to query health at a node.
   [<host>] - Name or IP, default localhost.
   <unit>*  - select from: vldb, management, vifmgr, bcomd, t1, smfpilot (test units).
              if omitted, dumps all product units on the host.
   Options may be combined, e.g., '-qrtx'.



rdb_dump shows cluster configuration and health information from the perspective of an individual unit.

 Notes:

  • 'Master' is a dynamic role, and 'epsilon' is a configuration setting. It often occurs that the master and epsilon sites differ.
  • The replication groups (vldb, vifmgr, bcomd, mgwd) operate independently. There may be different masters and health information for each. However, the configuration info should be shared.

Configuration
To analyze inter-box issues:

  1. Check that the environment and configuration (-e and -f) match as expected. 
  2. Check that the various unit instances agree on configuration.

Health (by default)
Given the correct configuration, health information will summarize the status of the replication group. 
Note: Health obtained from the master is always the most accurate; there is a slight delay in propagating secondary info to other secondaries, but they will come into agreement.

Monitoring
Use -c to continuously monitor a box under normal operation. Also, when rebooting a box, use -c to show the apps as they start and come online.

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.