Node quorum monitoring is controlled by the
AvailabilityZone
configuration (see the section called “AvailabilityZone”). Node quorum monitoring is enabled by
setting the
or
percentageOfVotesAvailabilityZone.
AvailabilityZone.
minimumNumberOfVotes
configuration properties. By default, node quorum monitoring is
disabled.
When node quorum monitoring is enabled, the number of active nodes required for a quorum is determined using one of these methods:
By default each node in a cluster has a single vote. However, this
can be changed using the AvailabilityZoneMembership
configuration (see the section called “AvailabilityZoneMembership”).
This allows certain nodes to be given more weight in the node quorum
calculation by assigning them a larger number of votes.
When node quorum monitoring is enabled, high availability services
are Disabled
if a node quorum is not met. This ensures
that a partition and it's associate data can never be active on multiple
nodes. When a node quorum is restored, the node state is set to
Partial
or Active
depending on the
number of active remote nodes. Once a quorum has been reestablished
partitions must be migrated back to the node. The current quorum state is
displayed on the High Availability
tab for a node (see
Figure 6.1, “Distribution status” for an example).
Figure 6.8, “Node quorum votes” shows a four node cluster
using precentage of votes to determine quorum. Each node shows their
configured voting values; giving the cluster a total of nine votes when
all nodes are active. The percentageOfVotes
configuration value is set to 51%.
Figure 6.9, “Voting node quorum status - network failures” shows different scenarios based on network failures using the nodes in Figure 6.8, “Node quorum votes”. In this example all machines remain active - just network connectivity between the machines is lost. For each case, the disabled and active nodes, the total number of votes, and the vote percentage, is shown for the sub-cluster caused by the network failure.
Figure 6.10, “Voting node quorum status - network and machine failures” shows different scenarios based on network and machine failures. For each case, the disabled and active nodes, the total number of votes, and the vote percentage, is shown for the sub-cluster caused by the network and machine failure.
When a new node deploy configuration (see the section called “NodeDeploy”) is activated that changes the node quorum configuration the changes are not immediate. All changes are immediately propagated to all nodes in the cluster, but they do not take affect until a node leaves and rejoins the cluster, or a remote node fails. This ensures that a misconfiguration does not cause nodes to be taken offline unexpectedly.
The configuration values for node quorum are summarized in the section called “AvailabilityZone”.
There are cases where an application can tolerate operating with partitions active on multiple nodes for periods of time. If this is acceptable behavior for an application, the node quorum monitoring should be disabled. When node quorum is disabled, the administrator must manually restore the cluster when the connectivity problem has been resolved.
The cluster partition summary display (see Figure 6.4, “Cluster partition summary”) can be used to determine if partitions are active on multiple nodes. Before attempting to restore the partitions active on multiple nodes, connectivity between all nodes must have been reestablished. See the section called “Node connectivity” for details on determining the status of node connectivity in a cluster.
Figure Figure 6.11, “Restoring a partition” shows partition
X
being restored from node C.X
to
node A.X
. The partition must currently be active on
both the from and to node specified in the restore node dialog. When the
availabilty zone is restored on node A.X
, the
partition objects will be merged with the objects on node
C.X
.
Clicking on the Enable
button on the
High Availability
tab for a node (see Figure 6.12, “Enabling partitions”), causes all partitions being
restored to this node to be merged with the partition specified as the
from node in the restore partition dialog, and then the partition is
made active on this node.
When these steps are complete, the cluster has been restored to service and all partitions now have a single active node.
Restoring a cluster after a multi-master scenario can also be performed using this command.
epadmin --servicename=A.X restore availabilityzone --from=C.X