add preprocessor for RST docs, see #2461 and #2431

The idea is to filter the sources, replacing @<var>@ occurrences with
the mapping for <var> (which is currently hard-coded). @@ -> @. In order
to make this work, I had to move the doc sources one directory down
(into akka-docs/rst) so that the filtered result could be in a sibling
directory so that relative links (to _sphinx plugins or real code) would
continue to work.

While I was at it I also changed it so that WARNINGs and ERRORs are not
swallowed into the debug dump anymore but printed at [warn] level
(minimum).

One piece of fallout is that the (online) html build is now run after
the normal one, not in parallel.
This commit is contained in:
Roland 2012-09-21 10:47:58 +02:00
parent c0f60da8cc
commit 9bc01ae265
266 changed files with 270 additions and 182 deletions

View file

@ -0,0 +1,430 @@
.. _cluster_usage:
###############
Cluster Usage
###############
.. note:: This module is :ref:`experimental <experimental>`. This document describes how to use the features implemented so far. More features are coming in Akka Coltrane. Track progress of the Coltrane milestone in `Assembla <http://www.assembla.com/spaces/akka/tickets>`_ and the `Roadmap <https://docs.google.com/document/d/18W9-fKs55wiFNjXL9q50PYOnR7-nnsImzJqHOPPbM4E/edit?hl=en_US>`_.
For introduction to the Akka Cluster concepts please see :ref:`cluster`.
Preparing Your Project for Clustering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Akka cluster is a separate jar file. Make sure that you have the following dependency in your project:
.. parsed-literal::
"com.typesafe.akka" %% "akka-cluster" % "@version@" @crossString@
If you are using the latest nightly build you should pick a timestamped Akka
version from
`<http://repo.typesafe.com/typesafe/snapshots/com/typesafe/akka/akka-cluster_@binVersion@/>`_.
We recommend against using ``SNAPSHOT`` in order to obtain stable builds.
A Simple Cluster Example
^^^^^^^^^^^^^^^^^^^^^^^^
The following small program together with its configuration starts an ``ActorSystem``
with the Cluster extension enabled. It joins the cluster and logs some membership events.
Try it out:
1. Add the following ``application.conf`` in your project, place it in ``src/main/resources``:
.. literalinclude:: ../../../akka-samples/akka-sample-cluster/src/main/resources/application.conf
:language: none
To enable cluster capabilities in your Akka project you should, at a minimum, add the :ref:`remoting-scala`
settings, but with ``akka.cluster.ClusterActorRefProvider``.
The ``akka.cluster.seed-nodes`` and cluster extension should normally also be added to your
``application.conf`` file.
The seed nodes are configured contact points for initial, automatic, join of the cluster.
Note that if you are going to start the nodes on different machines you need to specify the
ip-addresses or host names of the machines in ``application.conf`` instead of ``127.0.0.1``
2. Add the following main program to your project, place it in ``src/main/scala``:
.. literalinclude:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/simple/SimpleClusterApp.scala
:language: scala
3. Start the first seed node. Open a sbt session in one terminal window and run::
run-main sample.cluster.simple.SimpleClusterApp 2551
2551 corresponds to the port of the first seed-nodes element in the configuration.
In the log output you see that the cluster node has been started and changed status to 'Up'.
4. Start the second seed node. Open a sbt session in another terminal window and run::
run-main sample.cluster.simple.SimpleClusterApp 2552
2552 corresponds to the port of the second seed-nodes element in the configuration.
In the log output you see that the cluster node has been started and joins the other seed node
and becomes a member of the cluster. It's status changed to 'Up'.
Switch over to the first terminal window and see in the log output that the member joined.
5. Start another node. Open a sbt session in yet another terminal window and run::
run-main sample.cluster.simple.SimpleClusterApp
Now you don't need to specify the port number, and it will use a random available port.
It joins one of the configured seed nodes. Look at the log output in the different terminal
windows.
Start even more nodes in the same way, if you like.
6. Shut down one of the nodes by pressing 'ctrl-c' in one of the terminal windows.
The other nodes will detect the failure after a while, which you can see in the log
output in the other terminals.
Look at the source code of the program again. What it does is to create an actor
and register it as subscriber of certain cluster events. It gets notified with
an snapshot event, ``CurrentClusterState`` that holds full state information of
the cluster. After that it receives events for changes that happen in the cluster.
Automatic vs. Manual Joining
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You may decide if joining to the cluster should be done automatically or manually.
By default it is automatic and you need to define the seed nodes in configuration
so that a new node has an initial contact point. When a new node is started it
sends a message to all seed nodes and then sends join command to the one that
answers first. If no one of the seed nodes replied (might not be started yet)
it retries this procedure until successful or shutdown.
There is one thing to be aware of regarding the seed node configured as the
first element in the ``seed-nodes`` configuration list.
The seed nodes can be started in any order and it is not necessary to have all
seed nodes running, but the first seed node must be started when initially
starting a cluster, otherwise the other seed-nodes will not become initialized
and no other node can join the cluster. Once more than two seed nodes have been
started it is no problem to shut down the first seed node. If it goes down it
must be manually joined to the cluster again.
Automatic joining of the first seed node is not possible, it would only join
itself. It is only the first seed node that has this restriction.
You can disable automatic joining with configuration:
akka.cluster.auto-join = off
Then you need to join manually, using :ref:`cluster_jmx` or :ref:`cluster_command_line`.
You can join to any node in the cluster. It doesn't have to be configured as
seed node. If you are not using auto-join there is no need to configure
seed nodes at all.
Joining can also be performed programatically with ``Cluster(system).join``.
Automatic vs. Manual Downing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When a member is considered by the failure detector to be unreachable the
leader is not allowed to perform its duties, such as changing status of
new joining members to 'Up'. The status of the unreachable member must be
changed to 'Down'. This can be performed automatically or manually. By
default it must be done manually, using using :ref:`cluster_jmx` or
:ref:`cluster_command_line`.
It can also be performed programatically with ``Cluster(system).down``.
You can enable automatic downing with configuration:
akka.cluster.auto-down = on
Be aware of that using auto-down implies that two separate clusters will
automatically be formed in case of network partition. That might be
desired by some applications but not by others.
Subscribe to Cluster Events
^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can subscribe to change notifications of the cluster membership by using
``Cluster(system).subscribe``. A snapshot of the full state,
``akka.cluster.ClusterEvent.CurrentClusterState``, is sent to the subscriber
as the first event, followed by events for incremental updates.
There are several types of change events, consult the API documentation
of classes that extends ``akka.cluster.ClusterEvent.ClusterDomainEvent``
for details about the events.
Worker Dial-in Example
----------------------
Let's take a look at an example that illustrates how workers, here named *backend*,
can detect and register to new master nodes, here named *frontend*.
The example application provides a service to transform text. When some text
is sent to one of the frontend services, it will be delegated to one of the
backend workers, which performs the transformation job, and sends the result back to
the original client. New backend nodes, as well as new frontend nodes, can be
added or removed to the cluster dynamically.
In this example the following imports are used:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#imports
Messages:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#messages
The backend worker that performs the transformation job:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#backend
Note that the ``TransformationBackend`` actor subscribes to cluster events to detect new,
potential, frontend nodes, and send them a registration message so that they know
that they can use the backend worker.
The frontend that receives user jobs and delegates to one of the registered backend workers:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#frontend
Note that the ``TransformationFrontend`` actor watch the registered backend
to be able to remove it from its list of availble backend workers.
Death watch uses the cluster failure detector for nodes in the cluster, i.e. it detects
network failures and JVM crashes, in addition to graceful termination of watched
actor.
This example is included in ``akka-samples/akka-sample-cluster``
and you can try by starting nodes in different terminal windows. For example, starting 2
frontend nodes and 3 backend nodes::
sbt
project akka-sample-cluster-experimental
run-main sample.cluster.transformation.TransformationFrontend 2551
run-main sample.cluster.transformation.TransformationBackend 2552
run-main sample.cluster.transformation.TransformationBackend
run-main sample.cluster.transformation.TransformationBackend
run-main sample.cluster.transformation.TransformationFrontend
.. note:: The above example should probably be designed as two separate, frontend/backend, clusters, when there is a `cluster client for decoupling clusters <https://www.assembla.com/spaces/akka/tickets/1165>`_.
Cluster Aware Routers
^^^^^^^^^^^^^^^^^^^^^
All :ref:`routers <routing-scala>` can be made aware of member nodes in the cluster, i.e.
deploying new routees or looking up routees on nodes in the cluster.
When a node becomes unavailble or leaves the cluster the routees of that node are
automatically unregistered from the router. When new nodes join the cluster additional
routees are added to the router, according to the configuration.
When using a router with routees looked up on the cluster member nodes, i.e. the routees
are already running, the configuration for a router looks like this:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/multi-jvm/scala/sample/cluster/stats/StatsSampleSpec.scala#router-lookup-config
It's the relative actor path defined in ``routees-path`` that identify what actor to lookup.
``nr-of-instances`` defines total number of routees in the cluster, but there will not be
more than one per node. Setting ``nr-of-instances`` to a high value will result in new routees
added to the router when nodes join the cluster.
The same type of router could also have been defined in code:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#router-lookup-in-code
When using a router with routees created and deployed on the cluster member nodes
the configuration for a router looks like this:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/multi-jvm/scala/sample/cluster/stats/StatsSampleSingleMasterSpec.scala#router-deploy-config
``nr-of-instances`` defines total number of routees in the cluster, but the number of routees
per node, ``max-nr-of-instances-per-node``, will not be exceeded. Setting ``nr-of-instances``
to a high value will result in creating and deploying additional routees when new nodes join
the cluster.
The same type of router could also have been defined in code:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#router-deploy-in-code
See :ref:`cluster_configuration` section for further descriptions of the settings.
Router Example
--------------
Let's take a look at how to use cluster aware routers.
The example application provides a service to calculate statistics for a text.
When some text is sent to the service it splits it into words, and delegates the task
to count number of characters in each word to a separate worker, a routee of a router.
The character count for each word is sent back to an aggregator that calculates
the average number of characters per word when all results have been collected.
In this example we use the following imports:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#imports
Messages:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#messages
The worker that counts number of characters in each word:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#worker
The service that receives text from users and splits it up into words, delegates to workers and aggregates:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#service
Note, nothing cluster specific so far, just plain actors.
We can use these actors with two different types of router setup. Either with lookup of routees,
or with create and deploy of routees. Remember, routees are the workers in this case.
We start with the router setup with lookup of routees. All nodes start ``StatsService`` and
``StatsWorker`` actors and the router is configured with ``routees-path``:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#start-router-lookup
This means that user requests can be sent to ``StatsService`` on any node and it will use
``StatsWorker`` on all nodes. There can only be one worker per node, but that worker could easily
fan out to local children if more parallelism is needed.
This example is included in ``akka-samples/akka-sample-cluster``
and you can try by starting nodes in different terminal windows. For example, starting 3
service nodes and 1 client::
run-main sample.cluster.stats.StatsSample 2551
run-main sample.cluster.stats.StatsSample 2552
run-main sample.cluster.stats.StatsSampleClient
run-main sample.cluster.stats.StatsSample
The above setup is nice for this example, but we will also take a look at how to use
a single master node that creates and deploys workers. To keep track of a single
master we need one additional actor:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#facade
The ``StatsFacade`` receives text from users and delegates to the current ``StatsService``, the single
master. It listens to cluster events to create or lookup the ``StatsService`` depending on if
it is on the same same node or on another node. We run the master on the same node as the leader of
the cluster members, which is nothing more than the address currently sorted first in the member ring,
i.e. it can change when new nodes join or when current leader leaves.
All nodes start ``StatsFacade`` and the router is now configured like this:
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#start-router-deploy
This example is included in ``akka-samples/akka-sample-cluster``
and you can try by starting nodes in different terminal windows. For example, starting 3
service nodes and 1 client::
run-main sample.cluster.stats.StatsSampleOneMaster 2551
run-main sample.cluster.stats.StatsSampleOneMaster 2552
run-main sample.cluster.stats.StatsSampleOneMasterClient
run-main sample.cluster.stats.StatsSampleOneMaster
.. note:: The above example, especially the last part, will be simplified when the cluster handles automatic actor partitioning.
.. _cluster_jmx:
JMX
^^^
Information and management of the cluster is available as JMX MBeans with the root name ``akka.Cluster``.
The JMX information can be displayed with an ordinary JMX console such as JConsole or JVisualVM.
From JMX you can:
* see what members that are part of the cluster
* see status of this node
* join this node to another node in cluster
* mark any node in the cluster as down
* tell any node in the cluster to leave
Member nodes are identified with their address, in format `akka://actor-system-name@hostname:port`.
.. _cluster_command_line:
Command Line Management
^^^^^^^^^^^^^^^^^^^^^^^
The cluster can be managed with the script `bin/akka-cluster` provided in the
Akka distribution.
Run it without parameters to see instructions about how to use the script::
Usage: bin/akka-cluster <node-hostname:jmx-port> <command> ...
Supported commands are:
join <node-url> - Sends request a JOIN node with the specified URL
leave <node-url> - Sends a request for node with URL to LEAVE the cluster
down <node-url> - Sends a request for marking node with URL as DOWN
member-status - Asks the member node for its current status
cluster-status - Asks the cluster for its current status (member ring,
unavailable nodes, meta data etc.)
leader - Asks the cluster who the current leader is
is-singleton - Checks if the cluster is a singleton cluster (single
node cluster)
is-available - Checks if the member node is available
is-running - Checks if the member node is running
has-convergence - Checks if there is a cluster convergence
Where the <node-url> should be on the format of 'akka://actor-system-name@hostname:port'
Examples: bin/akka-cluster localhost:9999 is-available
bin/akka-cluster localhost:9999 join akka://MySystem@darkstar:2552
bin/akka-cluster localhost:9999 cluster-status
To be able to use the script you must enable remote monitoring and management when starting the JVMs of the cluster nodes,
as described in `Monitoring and Management Using JMX Technology <http://docs.oracle.com/javase/6/docs/technotes/guides/management/agent.html>`_
Example of system properties to enable remote monitoring and management::
java -Dcom.sun.management.jmxremote.port=9999 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false
.. _cluster_configuration:
Configuration
^^^^^^^^^^^^^
There are several configuration properties for the cluster. We refer to the following
reference file for more information:
.. literalinclude:: ../../../akka-cluster/src/main/resources/reference.conf
:language: none
Cluster Scheduler
-----------------
It is recommended that you change the ``tick-duration`` to 33 ms or less
of the default scheduler when using cluster, if you don't need to have it
configured to a longer duration for other reasons. If you don't do this
a dedicated scheduler will be used for periodic tasks of the cluster, which
introduce the extra overhead of another thread.
::
# shorter tick-duration of default scheduler when using cluster
akka.scheduler.tick-duration.tick-duration = 33ms

View file

@ -0,0 +1,644 @@
.. _cluster:
######################
Cluster Specification
######################
.. note:: This module is :ref:`experimental <experimental>`. This document describes the design concepts of the new clustering coming in Akka Coltrane. Not everything described here is implemented yet.
Intro
=====
Akka Cluster provides a fault-tolerant, elastic, decentralized peer-to-peer
cluster with no single point of failure (SPOF) or single point of bottleneck
(SPOB). It implements a Dynamo-style system using gossip protocols, automatic
failure detection, automatic partitioning, handoff, and cluster rebalancing. But
with some differences due to the fact that it is not just managing passive data,
but actors - active, sometimes stateful, components that also have requirements
on message ordering, the number of active instances in the cluster, etc.
Terms
=====
These terms are used throughout the documentation.
**node**
A logical member of a cluster. There could be multiple nodes on a physical
machine. Defined by a `hostname:port` tuple.
**cluster**
A set of nodes. Contains distributed Akka applications.
**partition**
An actor or subtree of actors in the Akka application that is distributed
within the cluster.
**partition point**
The actor at the head of a partition. The point around which a partition is
formed.
**partition path**
Also referred to as the actor address. Has the format `actor1/actor2/actor3`
**instance count**
The number of instances of a partition in the cluster. Also referred to as the
``N-value`` of the partition.
**instance node**
A node that an actor instance is assigned to.
**partition table**
A mapping from partition path to a set of instance nodes (where the nodes are
referred to by the ordinal position given the nodes in sorted order).
**leader**
A single node in the cluster that acts as the leader. Managing cluster convergence,
partitions, fail-over, rebalancing etc.
Membership
==========
A cluster is made up of a set of member nodes. The identifier for each node is a
``hostname:port`` pair. An Akka application is distributed over a cluster with
each node hosting some part of the application. Cluster membership and
partitioning of the application are decoupled. A node could be a member of a
cluster without hosting any actors.
Singleton Cluster
-----------------
If a node does not have a preconfigured contact point to join in the Akka
configuration, then it is considered a singleton cluster (single node cluster)
and will automatically transition from ``joining`` to ``up``. Singleton clusters
can later explicitly send a ``Join`` message to another node to form a N-node
cluster. It is also possible to link multiple N-node clusters by ``joining`` them.
Gossip
------
The cluster membership used in Akka is based on Amazon's `Dynamo`_ system and
particularly the approach taken in Basho's' `Riak`_ distributed database.
Cluster membership is communicated using a `Gossip Protocol`_, where the current
state of the cluster is gossiped randomly through the cluster. Joining a cluster
is initiated by issuing a ``Join`` command to one of the nodes in the cluster to
join.
.. _Gossip Protocol: http://en.wikipedia.org/wiki/Gossip_protocol
.. _Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
.. _Riak: http://basho.com/technology/architecture/
Vector Clocks
^^^^^^^^^^^^^
`Vector clocks`_ are an algorithm for generating a partial ordering of events in
a distributed system and detecting causality violations.
We use vector clocks to to reconcile and merge differences in cluster state
during gossiping. A vector clock is a set of (node, counter) pairs. Each update
to the cluster state has an accompanying update to the vector clock.
One problem with vector clocks is that their history can over time be very long,
which will both make comparisons take longer time as well as take up unnecessary
memory. To solve that problem we do pruning of the vector clocks according to
the `pruning algorithm`_ in Riak.
.. _Vector Clocks: http://en.wikipedia.org/wiki/Vector_clock
.. _pruning algorithm: http://wiki.basho.com/Vector-Clocks.html#Vector-Clock-Pruning
Gossip Convergence
^^^^^^^^^^^^^^^^^^
Information about the cluster converges at certain points of time. This is when
all nodes have seen the same cluster state. Convergence is recognised by passing
a map from node to current state version during gossip. This information is
referred to as the gossip overview. When all versions in the overview are equal
there is convergence. Gossip convergence cannot occur while any nodes are
unreachable, either the nodes become reachable again, or the nodes need to be
moved into the ``down`` or ``removed`` states (see section on `Member states`_
below).
Failure Detector
^^^^^^^^^^^^^^^^
The failure detector is responsible for trying to detect if a node is
unreachable from the rest of the cluster. For this we are using an
implementation of `The Phi Accrual Failure Detector`_ by Hayashibara et al.
An accrual failure detector decouple monitoring and interpretation. That makes
them applicable to a wider area of scenarios and more adequate to build generic
failure detection services. The idea is that it is keeping a history of failure
statistics, calculated from heartbeats received from other nodes, and is
trying to do educated guesses by taking multiple factors, and how they
accumulate over time, into account in order to come up with a better guess if a
specific node is up or down. Rather than just answering "yes" or "no" to the
question "is the node down?" it returns a ``phi`` value representing the
likelihood that the node is down.
The ``threshold`` that is the basis for the calculation is configurable by the
user. A low ``threshold`` is prone to generate many wrong suspicions but ensures
a quick detection in the event of a real crash. Conversely, a high ``threshold``
generates fewer mistakes but needs more time to detect actual crashes. The
default ``threshold`` is 8 and is appropriate for most situations. However in
cloud environments, such as Amazon EC2, the value could be increased to 12 in
order to account for network issues that sometimes occur on such platforms.
.. _The Phi Accrual Failure Detector: http://ddg.jaist.ac.jp/pub/HDY+04.pdf
Leader
^^^^^^
After gossip convergence a ``leader`` for the cluster can be determined. There is no
``leader`` election process, the ``leader`` can always be recognised deterministically
by any node whenever there is gossip convergence. The ``leader`` is simply the first
node in sorted order that is able to take the leadership role, where the only
allowed member states for a ``leader`` are ``up``, ``leaving`` or ``exiting`` (see
below for more information about member states).
The role of the ``leader`` is to shift members in and out of the cluster, changing
``joining`` members to the ``up`` state or ``exiting`` members to the
``removed`` state, and to schedule rebalancing across the cluster. Currently
``leader`` actions are only triggered by receiving a new cluster state with gossip
convergence but it may also be possible for the user to explicitly rebalance the
cluster by specifying migrations, or to rebalance the cluster automatically
based on metrics from member nodes. Metrics may be spread using the gossip
protocol or possibly more efficiently using a *random chord* method, where the
``leader`` contacts several random nodes around the cluster ring and each contacted
node gathers information from their immediate neighbours, giving a random
sampling of load information.
The ``leader`` also has the power, if configured so, to "auto-down" a node that
according to the Failure Detector is considered unreachable. This means setting
the unreachable node status to ``down`` automatically.
Seed Nodes
^^^^^^^^^^
The seed nodes are configured contact points for inital join of the cluster.
When a new node is started started it sends a message to all seed nodes and
then sends join command to the one that answers first.
It is possible to turn off automatic join.
Gossip Protocol
^^^^^^^^^^^^^^^
A variation of *push-pull gossip* is used to reduce the amount of gossip
information sent around the cluster. In push-pull gossip a digest is sent
representing current versions but not actual values; the recipient of the gossip
can then send back any values for which it has newer versions and also request
values for which it has outdated versions. Akka uses a single shared state with
a vector clock for versioning, so the variant of push-pull gossip used in Akka
makes use of the gossip overview (containing the current state versions for all
nodes) to only push the actual state as needed. This also allows any node to
easily determine which other nodes have newer or older information, not just the
nodes involved in a gossip exchange.
Periodically, the default is every 1 second, each node chooses another random
node to initiate a round of gossip with. The choice of node is random but can
also include extra gossiping nodes with either newer or older state versions.
The gossip overview contains the current state version for all nodes and also a
list of unreachable nodes. Whenever a node receives a gossip overview it updates
the `Failure Detector`_ with the liveness information.
The nodes defined as ``seed`` nodes are just regular member nodes whose only
"special role" is to function as contact points in the cluster.
During each round of gossip exchange it sends Gossip to random node with
newer or older state information, if any, based on the current gossip overview,
with some probability. Otherwise Gossip to any random live node.
The gossiper only sends the gossip overview to the chosen node. The recipient of
the gossip can use the gossip overview to determine whether:
1. it has a newer version of the gossip state, in which case it sends that back
to the gossiper, or
2. it has an outdated version of the state, in which case the recipient requests
the current state from the gossiper
If the recipient and the gossip have the same version then the gossip state is
not sent or requested.
The main structures used in gossiping are the gossip overview and the gossip
state::
GossipOverview {
versions: Map[Node, VectorClock],
unreachable: Set[Node]
}
GossipState {
version: VectorClock,
members: SortedSet[Member],
partitions: Tree[PartitionPath, Node],
pending: Set[PartitionChange],
meta: Option[Map[String, Array[Byte]]]
}
Some of the other structures used are::
Node = InetSocketAddress
Member {
node: Node,
state: MemberState
}
MemberState = Joining | Up | Leaving | Exiting | Down | Removed
PartitionChange {
from: Node,
to: Node,
path: PartitionPath,
status: PartitionChangeStatus
}
PartitionChangeStatus = Awaiting | Complete
Membership Lifecycle
--------------------
A node begins in the ``joining`` state. Once all nodes have seen that the new
node is joining (through gossip convergence) the ``leader`` will set the member
state to ``up`` and can start assigning partitions to the new node.
If a node is leaving the cluster in a safe, expected manner then it switches to
the ``leaving`` state. The ``leader`` will reassign partitions across the cluster
(it is possible for a leaving node to itself be the ``leader``). When all partition
handoff has completed then the node will change to the ``exiting`` state. Once
all nodes have seen the exiting state (convergence) the ``leader`` will remove the
node from the cluster, marking it as ``removed``.
If a node is unreachable then gossip convergence is not possible and therefore
any ``leader`` actions are also not possible (for instance, allowing a node to
become a part of the cluster, or changing actor distribution). To be able to
move forward the state of the unreachable nodes must be changed. If the
unreachable node is experiencing only transient difficulties then it can be
explicitly marked as ``down`` using the ``down`` user action. When this node
comes back up and begins gossiping it will automatically go through the joining
process again. If the unreachable node will be permanently down then it can be
removed from the cluster directly by shutting the actor system down or killing it
through an external ``SIGKILL`` signal, invocation of ``System.exit(status)`` or
similar. The cluster can, through the leader, also *auto-down* a node.
This means that nodes can join and leave the cluster at any point in time, i.e.
provide cluster elasticity.
State Diagram for the Member States
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. image:: images/member-states.png
Member States
^^^^^^^^^^^^^
- **joining**
transient state when joining a cluster
- **up**
normal operating state
- **leaving** / **exiting**
states during graceful removal
- **down**
marked as down/offline/unreachable
- **removed**
tombstone state (no longer a member)
User Actions
^^^^^^^^^^^^
- **join**
join a single node to a cluster - can be explicit or automatic on
startup if a node to join have been specified in the configuration
- **leave**
tell a node to leave the cluster gracefully
- **down**
mark a node as temporarily down
Leader Actions
^^^^^^^^^^^^^^
The ``leader`` has the following duties:
- shifting members in and out of the cluster
- joining -> up
- exiting -> removed
- partition distribution
- scheduling handoffs (pending changes)
- setting the partition table (partition path -> base node)
- Automatic rebalancing based on runtime metrics in the system (such as CPU,
RAM, Garbage Collection, mailbox depth etc.)
Partitioning
============
Each partition (an actor or actor subtree) in the actor system is assigned to a
set of nodes in the cluster. The actor at the head of the partition is referred
to as the partition point. The mapping from partition path (actor address of the
format "a/b/c") to instance nodes is stored in the partition table and is
maintained as part of the cluster state through the gossip protocol. The
partition table is only updated by the ``leader`` node. Currently the only possible
partition points are *routed* actors.
Routed actors can have an instance count greater than one. The instance count is
also referred to as the ``N-value``. If the ``N-value`` is greater than one then
a set of instance nodes will be given in the partition table.
Note that in the first implementation there may be a restriction such that only
top-level partitions are possible (the highest possible partition points are
used and sub-partitioning is not allowed). Still to be explored in more detail.
The cluster ``leader`` determines the current instance count for a partition based
on two axes: fault-tolerance and scaling.
Fault-tolerance determines a minimum number of instances for a routed actor
(allowing N-1 nodes to crash while still maintaining at least one running actor
instance). The user can specify a function from current number of nodes to the
number of acceptable node failures: n: Int => f: Int where f < n.
Scaling reflects the number of instances needed to maintain good throughput and
is influenced by metrics from the system, particularly a history of mailbox
size, CPU load, and GC percentages. It may also be possible to accept scaling
hints from the user that indicate expected load.
The balancing of partitions can be determined in a very simple way in the first
implementation, where the overlap of partitions is minimized. Partitions are
spread over the cluster ring in a circular fashion, with each instance node in
the first available space. For example, given a cluster with ten nodes and three
partitions, A, B, and C, having N-values of 4, 3, and 5; partition A would have
instances on nodes 1-4; partition B would have instances on nodes 5-7; partition
C would have instances on nodes 8-10 and 1-2. The only overlap is on nodes 1 and
2.
The distribution of partitions is not limited, however, to having instances on
adjacent nodes in the sorted ring order. Each instance can be assigned to any
node and the more advanced load balancing algorithms will make use of this. The
partition table contains a mapping from path to instance nodes. The partitioning
for the above example would be::
A -> { 1, 2, 3, 4 }
B -> { 5, 6, 7 }
C -> { 8, 9, 10, 1, 2 }
If 5 new nodes join the cluster and in sorted order these nodes appear after the
current nodes 2, 4, 5, 7, and 8, then the partition table could be updated to
the following, with all instances on the same physical nodes as before::
A -> { 1, 2, 4, 5 }
B -> { 7, 9, 10 }
C -> { 12, 14, 15, 1, 2 }
When rebalancing is required the ``leader`` will schedule handoffs, gossiping a set
of pending changes, and when each change is complete the ``leader`` will update the
partition table.
Handoff
-------
Handoff for an actor-based system is different than for a data-based system. The
most important point is that message ordering (from a given node to a given
actor instance) may need to be maintained. If an actor is a singleton actor
(only one instance possible throughout the cluster) then the cluster may also
need to assure that there is only one such actor active at any one time. Both of
these situations can be handled by forwarding and buffering messages during
transitions.
A *graceful handoff* (one where the previous host node is up and running during
the handoff), given a previous host node ``N1``, a new host node ``N2``, and an
actor partition ``A`` to be migrated from ``N1`` to ``N2``, has this general
structure:
1. the ``leader`` sets a pending change for ``N1`` to handoff ``A`` to ``N2``
2. ``N1`` notices the pending change and sends an initialization message to ``N2``
3. in response ``N2`` creates ``A`` and sends back a ready message
4. after receiving the ready message ``N1`` marks the change as
complete and shuts down ``A``
5. the ``leader`` sees the migration is complete and updates the partition table
6. all nodes eventually see the new partitioning and use ``N2``
Transitions
^^^^^^^^^^^
There are transition times in the handoff process where different approaches can
be used to give different guarantees.
Migration Transition
~~~~~~~~~~~~~~~~~~~~
The first transition starts when ``N1`` initiates the moving of ``A`` and ends
when ``N1`` receives the ready message, and is referred to as the *migration
transition*.
The first question is; during the migration transition, should:
- ``N1`` continue to process messages for ``A``?
- Or is it important that no messages for ``A`` are processed on
``N1`` once migration begins?
If it is okay for the previous host node ``N1`` to process messages during
migration then there is nothing that needs to be done at this point.
If no messages are to be processed on the previous host node during migration
then there are two possibilities: the messages are forwarded to the new host and
buffered until the actor is ready, or the messages are simply dropped by
terminating the actor and allowing the normal dead letter process to be used.
Update Transition
~~~~~~~~~~~~~~~~~
The second transition begins when the migration is marked as complete and ends
when all nodes have the updated partition table (when all nodes will use ``N2``
as the host for ``A``, i.e. we have convergence) and is referred to as the
*update transition*.
Once the update transition begins ``N1`` can forward any messages it receives
for ``A`` to the new host ``N2``. The question is whether or not message
ordering needs to be preserved. If messages sent to the previous host node
``N1`` are being forwarded, then it is possible that a message sent to ``N1``
could be forwarded after a direct message to the new host ``N2``, breaking
message ordering from a client to actor ``A``.
In this situation ``N2`` can keep a buffer for messages per sending node. Each
buffer is flushed and removed when an acknowledgement (``ack``) message has been
received. When each node in the cluster sees the partition update it first sends
an ``ack`` message to the previous host node ``N1`` before beginning to use
``N2`` as the new host for ``A``. Any messages sent from the client node
directly to ``N2`` will be buffered. ``N1`` can count down the number of acks to
determine when no more forwarding is needed. The ``ack`` message from any node
will always follow any other messages sent to ``N1``. When ``N1`` receives the
``ack`` message it also forwards it to ``N2`` and again this ``ack`` message
will follow any other messages already forwarded for ``A``. When ``N2`` receives
an ``ack`` message, the buffer for the sending node can be flushed and removed.
Any subsequent messages from this sending node can be queued normally. Once all
nodes in the cluster have acknowledged the partition change and ``N2`` has
cleared all buffers, the handoff is complete and message ordering has been
preserved. In practice the buffers should remain small as it is only those
messages sent directly to ``N2`` before the acknowledgement has been forwarded
that will be buffered.
Graceful Handoff
^^^^^^^^^^^^^^^^
A more complete process for graceful handoff would be:
1. the ``leader`` sets a pending change for ``N1`` to handoff ``A`` to ``N2``
2. ``N1`` notices the pending change and sends an initialization message to
``N2``. Options:
a. keep ``A`` on ``N1`` active and continuing processing messages as normal
b. ``N1`` forwards all messages for ``A`` to ``N2``
c. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
becoming dead letters)
3. in response ``N2`` creates ``A`` and sends back a ready message. Options:
a. ``N2`` simply processes messages for ``A`` as normal
b. ``N2`` creates a buffer per sending node for ``A``. Each buffer is
opened (flushed and removed) when an acknowledgement for the sending
node has been received (via ``N1``)
4. after receiving the ready message ``N1`` marks the change as complete. Options:
a. ``N1`` forwards all messages for ``A`` to ``N2`` during the update transition
b. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
becoming dead letters)
5. the ``leader`` sees the migration is complete and updates the partition table
6. all nodes eventually see the new partitioning and use ``N2``
i. each node sends an acknowledgement message to ``N1``
ii. when ``N1`` receives the acknowledgement it can count down the pending
acknowledgements and remove forwarding when complete
iii. when ``N2`` receives the acknowledgement it can open the buffer for the
sending node (if buffers are used)
The default approach is to take options 2a, 3a, and 4a - allowing ``A`` on
``N1`` to continue processing messages during migration and then forwarding any
messages during the update transition. This assumes stateless actors that do not
have a dependency on message ordering from any given source.
- If an actor has a distributed durable mailbox then nothing needs to be done,
other than migrating the actor.
- If message ordering needs to be maintained during the update transition then
option 3b can be used, creating buffers per sending node.
- If the actors are robust to message send failures then the dropping messages
approach can be used (with no forwarding or buffering needed).
- If an actor is a singleton (only one instance possible throughout the cluster)
and state is transferred during the migration initialization, then options 2b
and 3b would be required.
Stateful Actor Replication
==========================
Support for stateful singleton actors will come in future releases of Akka, and
is scheduled for Akka 2.2. Having a Dynamo base for the clustering already we
should use the same infrastructure to provide stateful actor clustering and
datastore as well. The stateful actor clustering should be layered on top of the
distributed datastore. See the next section for a rough outline on how the
distributed datastore could be implemented.
Implementing a Dynamo-style Distributed Database on top of Akka Cluster
-----------------------------------------------------------------------
The missing pieces to implement a full Dynamo-style eventually consistent data
storage on top of the Akka Cluster as described in this document are:
- Configuration of ``READ`` and ``WRITE`` consistency levels according to the
``N/R/W`` numbers defined in the Dynamo paper.
- R = read replica count
- W = write replica count
- N = replication factor
- Q = QUORUM = N / 2 + 1
- W + R > N = full consistency
- Define a versioned data message wrapper::
Versioned[T](hash: Long, version: VectorClock, data: T)
- Define a single system data broker actor on each node that uses a ``Consistent
Hashing Router`` and that have instances on all other nodes in the node ring.
- For ``WRITE``:
1. Wrap data in a ``Versioned Message``
2. Send a ``Versioned Message`` with the data is sent to a number of nodes
matching the ``W-value``.
- For ``READ``:
1. Read in the ``Versioned Message`` with the data from as many replicas as
you need for the consistency level required by the ``R-value``.
2. Do comparison on the versions (using `Vector Clocks`_)
3. If the versions differ then do `Read Repair`_ to update the inconsistent
nodes.
4. Return the latest versioned data.
.. _Read Repair: http://wiki.apache.org/cassandra/ReadRepair

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.5 KiB

View file

@ -0,0 +1,8 @@
Cluster
=======
.. toctree::
:maxdepth: 2
cluster
cluster-usage