2017-05-10 16:20:38 +02:00
|
|
|
# Cluster Specification
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
@@@ note
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
This document describes the design concepts of the clustering.
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
@@@
|
2011-10-26 12:23:19 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
## Intro
|
2011-10-26 14:41:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
Akka Cluster provides a fault-tolerant decentralized peer-to-peer based cluster
|
|
|
|
|
[membership](#membership) service with no single point of failure or single point of bottleneck.
|
|
|
|
|
It does this using [gossip](#gossip) protocols and an automatic [failure detector](#failure-detector).
|
2011-10-26 14:41:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
## Terms
|
2011-10-26 12:23:19 +02:00
|
|
|
|
|
|
|
|
**node**
|
2017-05-10 16:20:38 +02:00
|
|
|
: A logical member of a cluster. There could be multiple nodes on a physical
|
|
|
|
|
machine. Defined by a *hostname:port:uid* tuple.
|
2011-10-26 12:23:19 +02:00
|
|
|
|
|
|
|
|
**cluster**
|
2017-05-10 16:20:38 +02:00
|
|
|
: A set of nodes joined together through the [membership](#membership) service.
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2012-02-07 14:20:51 +01:00
|
|
|
**leader**
|
2017-05-10 16:20:38 +02:00
|
|
|
: A single node in the cluster that acts as the leader. Managing cluster convergence
|
|
|
|
|
and membership state transitions.
|
2012-02-07 14:20:51 +01:00
|
|
|
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
## Membership
|
2011-10-25 16:28:12 +02:00
|
|
|
|
|
|
|
|
A cluster is made up of a set of member nodes. The identifier for each node is a
|
2017-05-10 16:20:38 +02:00
|
|
|
`hostname:port:uid` tuple. An Akka application can be distributed over a cluster with
|
2015-07-01 14:58:14 +02:00
|
|
|
each node hosting some part of the application. Cluster membership and the actors running
|
|
|
|
|
on that node of the application are decoupled. A node could be a member of a
|
2014-02-04 15:51:08 +01:00
|
|
|
cluster without hosting any actors. Joining a cluster is initiated
|
2017-05-10 16:20:38 +02:00
|
|
|
by issuing a `Join` command to one of the nodes in the cluster to join.
|
2011-10-26 12:23:19 +02:00
|
|
|
|
2013-08-22 12:43:13 +02:00
|
|
|
The node identifier internally also contains a UID that uniquely identifies this
|
2017-05-10 16:20:38 +02:00
|
|
|
actor system instance at that `hostname:port`. Akka uses the UID to be able to
|
2013-08-22 12:43:13 +02:00
|
|
|
reliably trigger remote death watch. This means that the same actor system can never
|
|
|
|
|
join a cluster again once it's been removed from that cluster. To re-join an actor
|
2017-05-10 16:20:38 +02:00
|
|
|
system with the same `hostname:port` to a cluster you have to stop the actor system
|
|
|
|
|
and start a new one with the same `hostname:port` which will then receive a different
|
2013-08-22 12:43:13 +02:00
|
|
|
UID.
|
2012-02-07 16:53:49 +01:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
The cluster membership state is a specialized [CRDT](http://hal.upmc.fr/docs/00/55/55/88/PDF/techreport.pdf), which means that it has a monotonic
|
2014-02-04 15:51:08 +01:00
|
|
|
merge function. When concurrent changes occur on different nodes the updates can always be
|
2015-07-08 18:41:38 +03:00
|
|
|
merged and converge to the same end result.
|
2014-02-04 15:51:08 +01:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
### Gossip
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
The cluster membership used in Akka is based on Amazon's [Dynamo](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) system and
|
|
|
|
|
particularly the approach taken in Basho's' [Riak](http://basho.com/technology/architecture/) distributed database.
|
|
|
|
|
Cluster membership is communicated using a [Gossip Protocol](http://en.wikipedia.org/wiki/Gossip_protocol), where the current
|
2012-11-16 12:34:21 +01:00
|
|
|
state of the cluster is gossiped randomly through the cluster, with preference to
|
2015-07-08 18:41:38 +03:00
|
|
|
members that have not seen the latest version.
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
#### Vector Clocks
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
[Vector clocks](http://en.wikipedia.org/wiki/Vector_clock) are a type of data structure and algorithm for generating a partial
|
2013-08-22 12:43:13 +02:00
|
|
|
ordering of events in a distributed system and detecting causality violations.
|
2011-10-26 12:23:19 +02:00
|
|
|
|
2013-08-12 18:04:37 +02:00
|
|
|
We use vector clocks to reconcile and merge differences in cluster state
|
2011-10-25 16:28:12 +02:00
|
|
|
during gossiping. A vector clock is a set of (node, counter) pairs. Each update
|
|
|
|
|
to the cluster state has an accompanying update to the vector clock.
|
|
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
#### Gossip Convergence
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2013-08-22 12:43:13 +02:00
|
|
|
Information about the cluster converges locally at a node at certain points in time.
|
|
|
|
|
This is when a node can prove that the cluster state he is observing has been observed
|
2014-02-04 15:51:08 +01:00
|
|
|
by all other nodes in the cluster. Convergence is implemented by passing a set of nodes
|
|
|
|
|
that have seen current state version during gossip. This information is referred to as the
|
|
|
|
|
seen set in the gossip overview. When all nodes are included in the seen set there is
|
|
|
|
|
convergence.
|
|
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
Gossip convergence cannot occur while any nodes are `unreachable`. The nodes need
|
|
|
|
|
to become `reachable` again, or moved to the `down` and `removed` states
|
|
|
|
|
(see the [Membership Lifecycle](#membership-lifecycle) section below). This only blocks the leader
|
2014-02-04 15:51:08 +01:00
|
|
|
from performing its cluster membership management and does not influence the application
|
|
|
|
|
running on top of the cluster. For example this means that during a network partition
|
|
|
|
|
it is not possible to add more nodes to the cluster. The nodes can join, but they
|
2017-05-10 16:20:38 +02:00
|
|
|
will not be moved to the `up` state until the partition has healed or the unreachable
|
2014-02-04 15:51:08 +01:00
|
|
|
nodes have been downed.
|
2011-10-26 12:23:19 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
#### Failure Detector
|
2011-10-26 12:23:19 +02:00
|
|
|
|
|
|
|
|
The failure detector is responsible for trying to detect if a node is
|
2017-05-10 16:20:38 +02:00
|
|
|
`unreachable` from the rest of the cluster. For this we are using an
|
|
|
|
|
implementation of [The Phi Accrual Failure Detector](http://www.jaist.ac.jp/~defago/files/pdf/IS_RR_2004_010.pdf) by Hayashibara et al.
|
2011-10-26 14:41:12 +02:00
|
|
|
|
|
|
|
|
An accrual failure detector decouple monitoring and interpretation. That makes
|
2011-10-27 15:17:49 +02:00
|
|
|
them applicable to a wider area of scenarios and more adequate to build generic
|
2011-10-26 14:41:12 +02:00
|
|
|
failure detection services. The idea is that it is keeping a history of failure
|
2012-06-28 11:36:13 +02:00
|
|
|
statistics, calculated from heartbeats received from other nodes, and is
|
2011-10-26 14:41:12 +02:00
|
|
|
trying to do educated guesses by taking multiple factors, and how they
|
|
|
|
|
accumulate over time, into account in order to come up with a better guess if a
|
|
|
|
|
specific node is up or down. Rather than just answering "yes" or "no" to the
|
2017-05-10 16:20:38 +02:00
|
|
|
question "is the node down?" it returns a `phi` value representing the
|
2011-10-26 14:41:12 +02:00
|
|
|
likelihood that the node is down.
|
|
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
The `threshold` that is the basis for the calculation is configurable by the
|
|
|
|
|
user. A low `threshold` is prone to generate many wrong suspicions but ensures
|
|
|
|
|
a quick detection in the event of a real crash. Conversely, a high `threshold`
|
2011-10-26 14:41:12 +02:00
|
|
|
generates fewer mistakes but needs more time to detect actual crashes. The
|
2017-05-10 16:20:38 +02:00
|
|
|
default `threshold` is 8 and is appropriate for most situations. However in
|
2011-10-26 14:41:12 +02:00
|
|
|
cloud environments, such as Amazon EC2, the value could be increased to 12 in
|
|
|
|
|
order to account for network issues that sometimes occur on such platforms.
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2013-08-22 12:43:13 +02:00
|
|
|
In a cluster each node is monitored by a few (default maximum 5) other nodes, and when
|
2017-05-10 16:20:38 +02:00
|
|
|
any of these detects the node as `unreachable` that information will spread to
|
2013-08-22 12:43:13 +02:00
|
|
|
the rest of the cluster through the gossip. In other words, only one node needs to
|
2017-05-10 16:20:38 +02:00
|
|
|
mark a node `unreachable` to have the rest of the cluster mark that node `unreachable`.
|
2013-08-27 15:14:53 +02:00
|
|
|
|
2014-02-04 15:51:08 +01:00
|
|
|
The nodes to monitor are picked out of neighbors in a hashed ordered node ring.
|
|
|
|
|
This is to increase the likelihood to monitor across racks and data centers, but the order
|
|
|
|
|
is the same on all nodes, which ensures full coverage.
|
|
|
|
|
|
|
|
|
|
Heartbeats are sent out every second and every heartbeat is performed in a request/reply
|
|
|
|
|
handshake with the replies used as input to the failure detector.
|
|
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
The failure detector will also detect if the node becomes `reachable` again. When
|
|
|
|
|
all nodes that monitored the `unreachable` node detects it as `reachable` again
|
|
|
|
|
the cluster, after gossip dissemination, will consider it as `reachable`.
|
2013-08-27 15:14:53 +02:00
|
|
|
|
|
|
|
|
If system messages cannot be delivered to a node it will be quarantined and then it
|
2017-05-10 16:20:38 +02:00
|
|
|
cannot come back from `unreachable`. This can happen if the there are too many
|
2013-11-28 10:54:07 +01:00
|
|
|
unacknowledged system messages (e.g. watch, Terminated, remote actor deployment,
|
2013-08-27 15:14:53 +02:00
|
|
|
failures of actors supervised by remote parent). Then the node needs to be moved
|
2017-05-10 16:20:38 +02:00
|
|
|
to the `down` or `removed` states (see the [Membership Lifecycle](#membership-lifecycle) section below)
|
2013-08-27 15:14:53 +02:00
|
|
|
and the actor system must be restarted before it can join the cluster again.
|
2013-08-22 12:43:13 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
#### Leader
|
2011-10-27 15:17:49 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
After gossip convergence a `leader` for the cluster can be determined. There is no
|
|
|
|
|
`leader` election process, the `leader` can always be recognised deterministically
|
2014-02-04 15:51:08 +01:00
|
|
|
by any node whenever there is gossip convergence. The leader is just a role, any node
|
2015-07-08 18:41:38 +03:00
|
|
|
can be the leader and it can change between convergence rounds.
|
2017-05-10 16:20:38 +02:00
|
|
|
The `leader` is simply the first node in sorted order that is able to take the leadership role,
|
|
|
|
|
where the preferred member states for a `leader` are `up` and `leaving`
|
|
|
|
|
(see the [Membership Lifecycle](#membership-lifecycle) section below for more information about member states).
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
The role of the `leader` is to shift members in and out of the cluster, changing
|
|
|
|
|
`joining` members to the `up` state or `exiting` members to the `removed`
|
|
|
|
|
state. Currently `leader` actions are only triggered by receiving a new cluster
|
2013-08-22 12:43:13 +02:00
|
|
|
state with gossip convergence.
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
The `leader` also has the power, if configured so, to "auto-down" a node that
|
|
|
|
|
according to the [Failure Detector](#failure-detector) is considered `unreachable`. This means setting
|
|
|
|
|
the `unreachable` node status to `down` automatically after a configured time
|
2013-09-11 16:09:51 +02:00
|
|
|
of unreachability.
|
2011-10-26 14:41:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
#### Seed Nodes
|
2012-06-21 10:58:35 +02:00
|
|
|
|
2014-02-04 15:51:08 +01:00
|
|
|
The seed nodes are configured contact points for new nodes joining the cluster.
|
|
|
|
|
When a new node is started it sends a message to all seed nodes and then sends
|
2013-08-22 12:43:13 +02:00
|
|
|
a join command to the seed node that answers first.
|
2012-06-21 10:58:35 +02:00
|
|
|
|
2015-07-08 18:41:38 +03:00
|
|
|
The seed nodes configuration value does not have any influence on the running
|
|
|
|
|
cluster itself, it is only relevant for new nodes joining the cluster as it
|
|
|
|
|
helps them to find contact points to send the join command to; a new member
|
2014-02-04 15:51:08 +01:00
|
|
|
can send this command to any current member of the cluster, not only to the seed nodes.
|
2012-06-21 10:58:35 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
#### Gossip Protocol
|
2011-10-26 14:41:12 +02:00
|
|
|
|
|
|
|
|
A variation of *push-pull gossip* is used to reduce the amount of gossip
|
|
|
|
|
information sent around the cluster. In push-pull gossip a digest is sent
|
|
|
|
|
representing current versions but not actual values; the recipient of the gossip
|
|
|
|
|
can then send back any values for which it has newer versions and also request
|
|
|
|
|
values for which it has outdated versions. Akka uses a single shared state with
|
|
|
|
|
a vector clock for versioning, so the variant of push-pull gossip used in Akka
|
2013-05-27 14:24:03 +02:00
|
|
|
makes use of this version to only push the actual state as needed.
|
2011-10-26 14:41:12 +02:00
|
|
|
|
|
|
|
|
Periodically, the default is every 1 second, each node chooses another random
|
2014-02-04 15:51:08 +01:00
|
|
|
node to initiate a round of gossip with. If less than ½ of the nodes resides in the
|
2015-07-08 18:41:38 +03:00
|
|
|
seen set (have seen the new state) then the cluster gossips 3 times instead of once
|
2014-02-04 15:51:08 +01:00
|
|
|
every second. This adjusted gossip interval is a way to speed up the convergence process
|
2015-07-08 18:41:38 +03:00
|
|
|
in the early dissemination phase after a state change.
|
2014-02-04 15:51:08 +01:00
|
|
|
|
|
|
|
|
The choice of node to gossip with is random but it is biased to towards nodes that
|
|
|
|
|
might not have seen the current state version. During each round of gossip exchange when
|
|
|
|
|
no convergence it uses a probability of 0.8 (configurable) to gossip to a node not
|
2015-07-08 18:41:38 +03:00
|
|
|
part of the seen set, i.e. that probably has an older version of the state. Otherwise
|
2014-02-04 15:51:08 +01:00
|
|
|
gossip to any random live node.
|
|
|
|
|
|
|
|
|
|
This biased selection is a way to speed up the convergence process in the late dissemination
|
|
|
|
|
phase after a state change.
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2014-02-04 15:51:08 +01:00
|
|
|
For clusters larger than 400 nodes (configurable, and suggested by empirical evidence)
|
|
|
|
|
the 0.8 probability is gradually reduced to avoid overwhelming single stragglers with
|
|
|
|
|
too many concurrent gossip requests. The gossip receiver also has a mechanism to
|
2015-07-08 18:41:38 +03:00
|
|
|
protect itself from too many simultaneous gossip messages by dropping messages that
|
2017-02-17 14:03:08 +03:00
|
|
|
have been enqueued in the mailbox for too long of a time.
|
2011-10-26 14:41:12 +02:00
|
|
|
|
2014-02-04 15:51:08 +01:00
|
|
|
While the cluster is in a converged state the gossiper only sends a small gossip status message containing the gossip
|
|
|
|
|
version to the chosen node. As soon as there is a change to the cluster (meaning non-convergence)
|
2015-07-08 18:41:38 +03:00
|
|
|
then it goes back to biased gossip again.
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2014-02-04 15:51:08 +01:00
|
|
|
The recipient of the gossip state or the gossip status can use the gossip version
|
|
|
|
|
(vector clock) to determine whether:
|
2011-10-26 14:41:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
1. it has a newer version of the gossip state, in which case it sends that back
|
|
|
|
|
to the gossiper
|
|
|
|
|
2. it has an outdated version of the state, in which case the recipient requests
|
|
|
|
|
the current state from the gossiper by sending back its version of the gossip state
|
|
|
|
|
3. it has conflicting gossip versions, in which case the different versions are merged
|
|
|
|
|
and sent back
|
2011-10-26 14:41:12 +02:00
|
|
|
|
|
|
|
|
If the recipient and the gossip have the same version then the gossip state is
|
|
|
|
|
not sent or requested.
|
|
|
|
|
|
2014-02-04 15:51:08 +01:00
|
|
|
The periodic nature of the gossip has a nice batching effect of state changes,
|
|
|
|
|
e.g. joining several nodes quickly after each other to one node will result in only
|
|
|
|
|
one state change to be spread to other members in the cluster.
|
|
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
The gossip messages are serialized with [protobuf](https://code.google.com/p/protobuf/) and also gzipped to reduce payload
|
2014-02-04 15:51:08 +01:00
|
|
|
size.
|
|
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
### Membership Lifecycle
|
2011-10-26 14:41:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
A node begins in the `joining` state. Once all nodes have seen that the new
|
|
|
|
|
node is joining (through gossip convergence) the `leader` will set the member
|
|
|
|
|
state to `up`.
|
2011-10-25 16:28:12 +02:00
|
|
|
|
|
|
|
|
If a node is leaving the cluster in a safe, expected manner then it switches to
|
2017-05-10 16:20:38 +02:00
|
|
|
the `leaving` state. Once the leader sees the convergence on the node in the
|
|
|
|
|
`leaving` state, the leader will then move it to `exiting`. Once all nodes
|
|
|
|
|
have seen the exiting state (convergence) the `leader` will remove the node
|
|
|
|
|
from the cluster, marking it as `removed`.
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
If a node is `unreachable` then gossip convergence is not possible and therefore
|
|
|
|
|
any `leader` actions are also not possible (for instance, allowing a node to
|
2013-08-22 12:43:13 +02:00
|
|
|
become a part of the cluster). To be able to move forward the state of the
|
2017-05-10 16:20:38 +02:00
|
|
|
`unreachable` nodes must be changed. It must become `reachable` again or marked
|
|
|
|
|
as `down`. If the node is to join the cluster again the actor system must be
|
2013-08-22 12:43:13 +02:00
|
|
|
restarted and go through the joining process again. The cluster can, through the
|
2016-02-26 20:35:19 +01:00
|
|
|
leader, also *auto-down* a node after a configured time of unreachability. If new
|
|
|
|
|
incarnation of unreachable node tries to rejoin the cluster old incarnation will be
|
2017-05-10 16:20:38 +02:00
|
|
|
marked as `down` and new incarnation can rejoin the cluster without manual intervention.
|
|
|
|
|
|
|
|
|
|
@@@ note
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
If you have *auto-down* enabled and the failure detector triggers, you
|
|
|
|
|
can over time end up with a lot of single node clusters if you don't put
|
|
|
|
|
measures in place to shut down nodes that have become `unreachable`. This
|
|
|
|
|
follows from the fact that the `unreachable` node will likely see the rest of
|
|
|
|
|
the cluster as `unreachable`, become its own leader and form its own cluster.
|
2011-10-26 13:55:22 +02:00
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
@@@
|
|
|
|
|
|
|
|
|
|
As mentioned before, if a node is `unreachable` then gossip convergence is not
|
|
|
|
|
possible and therefore any `leader` actions are also not possible. By enabling
|
|
|
|
|
`akka.cluster.allow-weakly-up-members` (enabled by default) it is possible to
|
2017-01-24 12:31:32 +01:00
|
|
|
let new joining nodes be promoted while convergence is not yet reached. These
|
2017-05-10 16:20:38 +02:00
|
|
|
`Joining` nodes will be promoted as `WeaklyUp`. Once gossip convergence is
|
|
|
|
|
reached, the leader will move `WeaklyUp` members to `Up`.
|
2011-10-25 16:28:12 +02:00
|
|
|
|
2015-09-04 12:38:49 +02:00
|
|
|
Note that members on the other side of a network partition have no knowledge about
|
2017-05-10 16:20:38 +02:00
|
|
|
the existence of the new members. You should for example not count `WeaklyUp`
|
2015-09-04 12:38:49 +02:00
|
|
|
members in quorum decisions.
|
|
|
|
|
|
2017-05-10 16:20:38 +02:00
|
|
|
#### State Diagram for the Member States (`akka.cluster.allow-weakly-up-members=off`)
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
#### State Diagram for the Member States (`akka.cluster.allow-weakly-up-members=on`)
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
#### Member States
|
|
|
|
|
|
|
|
|
|
*
|
|
|
|
|
**joining**
|
|
|
|
|
: transient state when joining a cluster
|
|
|
|
|
|
|
|
|
|
*
|
|
|
|
|
**weakly up**
|
|
|
|
|
: transient state while network split (only if `akka.cluster.allow-weakly-up-members=on`)
|
|
|
|
|
|
|
|
|
|
*
|
|
|
|
|
**up**
|
|
|
|
|
: normal operating state
|
|
|
|
|
|
|
|
|
|
*
|
|
|
|
|
**leaving**
|
|
|
|
|
/
|
|
|
|
|
**exiting**
|
|
|
|
|
: states during graceful removal
|
|
|
|
|
|
|
|
|
|
*
|
|
|
|
|
**down**
|
|
|
|
|
: marked as down (no longer part of cluster decisions)
|
|
|
|
|
|
|
|
|
|
*
|
|
|
|
|
**removed**
|
|
|
|
|
: tombstone state (no longer a member)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### User Actions
|
|
|
|
|
|
|
|
|
|
*
|
|
|
|
|
**join**
|
|
|
|
|
: join a single node to a cluster - can be explicit or automatic on
|
|
|
|
|
startup if a node to join have been specified in the configuration
|
|
|
|
|
|
|
|
|
|
*
|
|
|
|
|
**leave**
|
|
|
|
|
: tell a node to leave the cluster gracefully
|
|
|
|
|
|
|
|
|
|
*
|
|
|
|
|
**down**
|
|
|
|
|
: mark a node as down
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### Leader Actions
|
|
|
|
|
|
|
|
|
|
The `leader` has the following duties:
|
|
|
|
|
|
|
|
|
|
* shifting members in and out of the cluster
|
|
|
|
|
* joining -> up
|
|
|
|
|
* exiting -> removed
|
|
|
|
|
|
|
|
|
|
#### Failure Detection and Unreachability
|
|
|
|
|
|
|
|
|
|
*
|
|
|
|
|
fd*
|
|
|
|
|
: the failure detector of one of the monitoring nodes has triggered
|
|
|
|
|
causing the monitored node to be marked as unreachable
|
|
|
|
|
|
|
|
|
|
*
|
|
|
|
|
unreachable*
|
|
|
|
|
: unreachable is not a real member states but more of a flag in addition
|
|
|
|
|
to the state signaling that the cluster is unable to talk to this node,
|
|
|
|
|
after being unreachable the failure detector may detect it as reachable
|
|
|
|
|
again and thereby remove the flag
|
|
|
|
|
|