Node quorum management

Node quorum monitoring is controlled by the AvailabilityZone configuration (see the section called “AvailabilityZone”). Node quorum monitoring is enabled by setting the AvailabilityZone.percentageOfVotes or AvailabilityZone.minimumNumberOfVotes configuration properties. By default, node quorum monitoring is disabled.

When node quorum monitoring is enabled, the number of active nodes required for a quorum is determined using one of these methods:

By default each node in a cluster has a single vote. However, this can be changed using the AvailabilityZoneMembership configuration (see the section called “AvailabilityZoneMembership”). This allows certain nodes to be given more weight in the node quorum calculation by assigning them a larger number of votes.

When node quorum monitoring is enabled, high availability services are Disabled if a node quorum is not met. This ensures that a partition and it's associate data can never be active on multiple nodes. When a node quorum is restored, the node state is set to Partial or Active depending on the number of active remote nodes. Once a quorum has been reestablished partitions must be migrated back to the node. The current quorum state is displayed on the High Availability tab for a node (see Figure 6.1, “Distribution status” for an example).

Figure 6.8, “Node quorum votes” shows a four node cluster using precentage of votes to determine quorum. Each node shows their configured voting values; giving the cluster a total of nine votes when all nodes are active. The percentageOfVotes configuration value is set to 51%.

Node quorum votes

Figure 6.8. Node quorum votes


Figure 6.9, “Voting node quorum status - network failures” shows different scenarios based on network failures using the nodes in Figure 6.8, “Node quorum votes”. In this example all machines remain active - just network connectivity between the machines is lost. For each case, the disabled and active nodes, the total number of votes, and the vote percentage, is shown for the sub-cluster caused by the network failure.

Voting node quorum status - network failures

Figure 6.9. Voting node quorum status - network failures


Figure 6.10, “Voting node quorum status - network and machine failures” shows different scenarios based on network and machine failures. For each case, the disabled and active nodes, the total number of votes, and the vote percentage, is shown for the sub-cluster caused by the network and machine failure.

Voting node quorum status - network and machine failures

Figure 6.10. Voting node quorum status - network and machine failures


When a new node deploy configuration (see the section called “NodeDeploy”) is activated that changes the node quorum configuration the changes are not immediate. All changes are immediately propagated to all nodes in the cluster, but they do not take affect until a node leaves and rejoins the cluster, or a remote node fails. This ensures that a misconfiguration does not cause nodes to be taken offline unexpectedly.

The configuration values for node quorum are summarized in the section called “AvailabilityZone”.

Recovering partitions active on multiple nodes

There are cases where an application can tolerate operating with partitions active on multiple nodes for periods of time. If this is acceptable behavior for an application, the node quorum monitoring should be disabled. When node quorum is disabled, the administrator must manually restore the cluster when the connectivity problem has been resolved.

The cluster partition summary display (see Figure 6.4, “Cluster partition summary”) can be used to determine if partitions are active on multiple nodes. Before attempting to restore the partitions active on multiple nodes, connectivity between all nodes must have been reestablished. See the section called “Node connectivity” for details on determining the status of node connectivity in a cluster.

Restoring a partition

Figure 6.11. Restoring a partition


Figure Figure 6.11, “Restoring a partition” shows partition X being restored from node C.X to node A.X. The partition must currently be active on both the from and to node specified in the restore node dialog. When the availabilty zone is restored on node A.X, the partition objects will be merged with the objects on node C.X.

Enabling partitions

Figure 6.12. Enabling partitions


Clicking on the Enable button on the High Availability tab for a node (see Figure 6.12, “Enabling partitions”), causes all partitions being restored to this node to be merged with the partition specified as the from node in the restore partition dialog, and then the partition is made active on this node.

When these steps are complete, the cluster has been restored to service and all partitions now have a single active node.

Restoring a cluster after a multi-master scenario can also be performed using this command.

epadmin --servicename=A.X restore availabilityzone --from=C.X