The idea is to filter the sources, replacing @<var>@ occurrences with the mapping for <var> (which is currently hard-coded). @@ -> @. In order to make this work, I had to move the doc sources one directory down (into akka-docs/rst) so that the filtered result could be in a sibling directory so that relative links (to _sphinx plugins or real code) would continue to work. While I was at it I also changed it so that WARNINGs and ERRORs are not swallowed into the debug dump anymore but printed at [warn] level (minimum). One piece of fallout is that the (online) html build is now run after the normal one, not in parallel.
This commit is contained in:
parent
c0f60da8cc
commit
9bc01ae265
266 changed files with 270 additions and 182 deletions
430
akka-docs/rst/cluster/cluster-usage.rst
Normal file
430
akka-docs/rst/cluster/cluster-usage.rst
Normal file
|
|
@ -0,0 +1,430 @@
|
|||
|
||||
.. _cluster_usage:
|
||||
|
||||
###############
|
||||
Cluster Usage
|
||||
###############
|
||||
|
||||
.. note:: This module is :ref:`experimental <experimental>`. This document describes how to use the features implemented so far. More features are coming in Akka Coltrane. Track progress of the Coltrane milestone in `Assembla <http://www.assembla.com/spaces/akka/tickets>`_ and the `Roadmap <https://docs.google.com/document/d/18W9-fKs55wiFNjXL9q50PYOnR7-nnsImzJqHOPPbM4E/edit?hl=en_US>`_.
|
||||
|
||||
For introduction to the Akka Cluster concepts please see :ref:`cluster`.
|
||||
|
||||
Preparing Your Project for Clustering
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The Akka cluster is a separate jar file. Make sure that you have the following dependency in your project:
|
||||
|
||||
.. parsed-literal::
|
||||
|
||||
"com.typesafe.akka" %% "akka-cluster" % "@version@" @crossString@
|
||||
|
||||
If you are using the latest nightly build you should pick a timestamped Akka
|
||||
version from
|
||||
`<http://repo.typesafe.com/typesafe/snapshots/com/typesafe/akka/akka-cluster_@binVersion@/>`_.
|
||||
We recommend against using ``SNAPSHOT`` in order to obtain stable builds.
|
||||
|
||||
A Simple Cluster Example
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The following small program together with its configuration starts an ``ActorSystem``
|
||||
with the Cluster extension enabled. It joins the cluster and logs some membership events.
|
||||
|
||||
Try it out:
|
||||
|
||||
1. Add the following ``application.conf`` in your project, place it in ``src/main/resources``:
|
||||
|
||||
|
||||
.. literalinclude:: ../../../akka-samples/akka-sample-cluster/src/main/resources/application.conf
|
||||
:language: none
|
||||
|
||||
To enable cluster capabilities in your Akka project you should, at a minimum, add the :ref:`remoting-scala`
|
||||
settings, but with ``akka.cluster.ClusterActorRefProvider``.
|
||||
The ``akka.cluster.seed-nodes`` and cluster extension should normally also be added to your
|
||||
``application.conf`` file.
|
||||
|
||||
The seed nodes are configured contact points for initial, automatic, join of the cluster.
|
||||
|
||||
Note that if you are going to start the nodes on different machines you need to specify the
|
||||
ip-addresses or host names of the machines in ``application.conf`` instead of ``127.0.0.1``
|
||||
|
||||
2. Add the following main program to your project, place it in ``src/main/scala``:
|
||||
|
||||
.. literalinclude:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/simple/SimpleClusterApp.scala
|
||||
:language: scala
|
||||
|
||||
|
||||
3. Start the first seed node. Open a sbt session in one terminal window and run::
|
||||
|
||||
run-main sample.cluster.simple.SimpleClusterApp 2551
|
||||
|
||||
2551 corresponds to the port of the first seed-nodes element in the configuration.
|
||||
In the log output you see that the cluster node has been started and changed status to 'Up'.
|
||||
|
||||
4. Start the second seed node. Open a sbt session in another terminal window and run::
|
||||
|
||||
run-main sample.cluster.simple.SimpleClusterApp 2552
|
||||
|
||||
|
||||
2552 corresponds to the port of the second seed-nodes element in the configuration.
|
||||
In the log output you see that the cluster node has been started and joins the other seed node
|
||||
and becomes a member of the cluster. It's status changed to 'Up'.
|
||||
|
||||
Switch over to the first terminal window and see in the log output that the member joined.
|
||||
|
||||
5. Start another node. Open a sbt session in yet another terminal window and run::
|
||||
|
||||
run-main sample.cluster.simple.SimpleClusterApp
|
||||
|
||||
Now you don't need to specify the port number, and it will use a random available port.
|
||||
It joins one of the configured seed nodes. Look at the log output in the different terminal
|
||||
windows.
|
||||
|
||||
Start even more nodes in the same way, if you like.
|
||||
|
||||
6. Shut down one of the nodes by pressing 'ctrl-c' in one of the terminal windows.
|
||||
The other nodes will detect the failure after a while, which you can see in the log
|
||||
output in the other terminals.
|
||||
|
||||
Look at the source code of the program again. What it does is to create an actor
|
||||
and register it as subscriber of certain cluster events. It gets notified with
|
||||
an snapshot event, ``CurrentClusterState`` that holds full state information of
|
||||
the cluster. After that it receives events for changes that happen in the cluster.
|
||||
|
||||
Automatic vs. Manual Joining
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
You may decide if joining to the cluster should be done automatically or manually.
|
||||
By default it is automatic and you need to define the seed nodes in configuration
|
||||
so that a new node has an initial contact point. When a new node is started it
|
||||
sends a message to all seed nodes and then sends join command to the one that
|
||||
answers first. If no one of the seed nodes replied (might not be started yet)
|
||||
it retries this procedure until successful or shutdown.
|
||||
|
||||
There is one thing to be aware of regarding the seed node configured as the
|
||||
first element in the ``seed-nodes`` configuration list.
|
||||
The seed nodes can be started in any order and it is not necessary to have all
|
||||
seed nodes running, but the first seed node must be started when initially
|
||||
starting a cluster, otherwise the other seed-nodes will not become initialized
|
||||
and no other node can join the cluster. Once more than two seed nodes have been
|
||||
started it is no problem to shut down the first seed node. If it goes down it
|
||||
must be manually joined to the cluster again.
|
||||
Automatic joining of the first seed node is not possible, it would only join
|
||||
itself. It is only the first seed node that has this restriction.
|
||||
|
||||
You can disable automatic joining with configuration:
|
||||
|
||||
akka.cluster.auto-join = off
|
||||
|
||||
Then you need to join manually, using :ref:`cluster_jmx` or :ref:`cluster_command_line`.
|
||||
You can join to any node in the cluster. It doesn't have to be configured as
|
||||
seed node. If you are not using auto-join there is no need to configure
|
||||
seed nodes at all.
|
||||
|
||||
Joining can also be performed programatically with ``Cluster(system).join``.
|
||||
|
||||
|
||||
Automatic vs. Manual Downing
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
When a member is considered by the failure detector to be unreachable the
|
||||
leader is not allowed to perform its duties, such as changing status of
|
||||
new joining members to 'Up'. The status of the unreachable member must be
|
||||
changed to 'Down'. This can be performed automatically or manually. By
|
||||
default it must be done manually, using using :ref:`cluster_jmx` or
|
||||
:ref:`cluster_command_line`.
|
||||
|
||||
It can also be performed programatically with ``Cluster(system).down``.
|
||||
|
||||
You can enable automatic downing with configuration:
|
||||
|
||||
akka.cluster.auto-down = on
|
||||
|
||||
Be aware of that using auto-down implies that two separate clusters will
|
||||
automatically be formed in case of network partition. That might be
|
||||
desired by some applications but not by others.
|
||||
|
||||
Subscribe to Cluster Events
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
You can subscribe to change notifications of the cluster membership by using
|
||||
``Cluster(system).subscribe``. A snapshot of the full state,
|
||||
``akka.cluster.ClusterEvent.CurrentClusterState``, is sent to the subscriber
|
||||
as the first event, followed by events for incremental updates.
|
||||
|
||||
There are several types of change events, consult the API documentation
|
||||
of classes that extends ``akka.cluster.ClusterEvent.ClusterDomainEvent``
|
||||
for details about the events.
|
||||
|
||||
Worker Dial-in Example
|
||||
----------------------
|
||||
|
||||
Let's take a look at an example that illustrates how workers, here named *backend*,
|
||||
can detect and register to new master nodes, here named *frontend*.
|
||||
|
||||
The example application provides a service to transform text. When some text
|
||||
is sent to one of the frontend services, it will be delegated to one of the
|
||||
backend workers, which performs the transformation job, and sends the result back to
|
||||
the original client. New backend nodes, as well as new frontend nodes, can be
|
||||
added or removed to the cluster dynamically.
|
||||
|
||||
In this example the following imports are used:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#imports
|
||||
|
||||
Messages:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#messages
|
||||
|
||||
The backend worker that performs the transformation job:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#backend
|
||||
|
||||
Note that the ``TransformationBackend`` actor subscribes to cluster events to detect new,
|
||||
potential, frontend nodes, and send them a registration message so that they know
|
||||
that they can use the backend worker.
|
||||
|
||||
The frontend that receives user jobs and delegates to one of the registered backend workers:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#frontend
|
||||
|
||||
Note that the ``TransformationFrontend`` actor watch the registered backend
|
||||
to be able to remove it from its list of availble backend workers.
|
||||
Death watch uses the cluster failure detector for nodes in the cluster, i.e. it detects
|
||||
network failures and JVM crashes, in addition to graceful termination of watched
|
||||
actor.
|
||||
|
||||
This example is included in ``akka-samples/akka-sample-cluster``
|
||||
and you can try by starting nodes in different terminal windows. For example, starting 2
|
||||
frontend nodes and 3 backend nodes::
|
||||
|
||||
sbt
|
||||
|
||||
project akka-sample-cluster-experimental
|
||||
|
||||
run-main sample.cluster.transformation.TransformationFrontend 2551
|
||||
|
||||
run-main sample.cluster.transformation.TransformationBackend 2552
|
||||
|
||||
run-main sample.cluster.transformation.TransformationBackend
|
||||
|
||||
run-main sample.cluster.transformation.TransformationBackend
|
||||
|
||||
run-main sample.cluster.transformation.TransformationFrontend
|
||||
|
||||
|
||||
.. note:: The above example should probably be designed as two separate, frontend/backend, clusters, when there is a `cluster client for decoupling clusters <https://www.assembla.com/spaces/akka/tickets/1165>`_.
|
||||
|
||||
Cluster Aware Routers
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
All :ref:`routers <routing-scala>` can be made aware of member nodes in the cluster, i.e.
|
||||
deploying new routees or looking up routees on nodes in the cluster.
|
||||
When a node becomes unavailble or leaves the cluster the routees of that node are
|
||||
automatically unregistered from the router. When new nodes join the cluster additional
|
||||
routees are added to the router, according to the configuration.
|
||||
|
||||
When using a router with routees looked up on the cluster member nodes, i.e. the routees
|
||||
are already running, the configuration for a router looks like this:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/multi-jvm/scala/sample/cluster/stats/StatsSampleSpec.scala#router-lookup-config
|
||||
|
||||
It's the relative actor path defined in ``routees-path`` that identify what actor to lookup.
|
||||
|
||||
``nr-of-instances`` defines total number of routees in the cluster, but there will not be
|
||||
more than one per node. Setting ``nr-of-instances`` to a high value will result in new routees
|
||||
added to the router when nodes join the cluster.
|
||||
|
||||
The same type of router could also have been defined in code:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#router-lookup-in-code
|
||||
|
||||
When using a router with routees created and deployed on the cluster member nodes
|
||||
the configuration for a router looks like this:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/multi-jvm/scala/sample/cluster/stats/StatsSampleSingleMasterSpec.scala#router-deploy-config
|
||||
|
||||
|
||||
``nr-of-instances`` defines total number of routees in the cluster, but the number of routees
|
||||
per node, ``max-nr-of-instances-per-node``, will not be exceeded. Setting ``nr-of-instances``
|
||||
to a high value will result in creating and deploying additional routees when new nodes join
|
||||
the cluster.
|
||||
|
||||
The same type of router could also have been defined in code:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#router-deploy-in-code
|
||||
|
||||
See :ref:`cluster_configuration` section for further descriptions of the settings.
|
||||
|
||||
|
||||
Router Example
|
||||
--------------
|
||||
|
||||
Let's take a look at how to use cluster aware routers.
|
||||
|
||||
The example application provides a service to calculate statistics for a text.
|
||||
When some text is sent to the service it splits it into words, and delegates the task
|
||||
to count number of characters in each word to a separate worker, a routee of a router.
|
||||
The character count for each word is sent back to an aggregator that calculates
|
||||
the average number of characters per word when all results have been collected.
|
||||
|
||||
In this example we use the following imports:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#imports
|
||||
|
||||
Messages:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#messages
|
||||
|
||||
The worker that counts number of characters in each word:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#worker
|
||||
|
||||
The service that receives text from users and splits it up into words, delegates to workers and aggregates:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#service
|
||||
|
||||
|
||||
Note, nothing cluster specific so far, just plain actors.
|
||||
|
||||
We can use these actors with two different types of router setup. Either with lookup of routees,
|
||||
or with create and deploy of routees. Remember, routees are the workers in this case.
|
||||
|
||||
We start with the router setup with lookup of routees. All nodes start ``StatsService`` and
|
||||
``StatsWorker`` actors and the router is configured with ``routees-path``:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#start-router-lookup
|
||||
|
||||
This means that user requests can be sent to ``StatsService`` on any node and it will use
|
||||
``StatsWorker`` on all nodes. There can only be one worker per node, but that worker could easily
|
||||
fan out to local children if more parallelism is needed.
|
||||
|
||||
This example is included in ``akka-samples/akka-sample-cluster``
|
||||
and you can try by starting nodes in different terminal windows. For example, starting 3
|
||||
service nodes and 1 client::
|
||||
|
||||
run-main sample.cluster.stats.StatsSample 2551
|
||||
|
||||
run-main sample.cluster.stats.StatsSample 2552
|
||||
|
||||
run-main sample.cluster.stats.StatsSampleClient
|
||||
|
||||
run-main sample.cluster.stats.StatsSample
|
||||
|
||||
The above setup is nice for this example, but we will also take a look at how to use
|
||||
a single master node that creates and deploys workers. To keep track of a single
|
||||
master we need one additional actor:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#facade
|
||||
|
||||
The ``StatsFacade`` receives text from users and delegates to the current ``StatsService``, the single
|
||||
master. It listens to cluster events to create or lookup the ``StatsService`` depending on if
|
||||
it is on the same same node or on another node. We run the master on the same node as the leader of
|
||||
the cluster members, which is nothing more than the address currently sorted first in the member ring,
|
||||
i.e. it can change when new nodes join or when current leader leaves.
|
||||
|
||||
All nodes start ``StatsFacade`` and the router is now configured like this:
|
||||
|
||||
.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#start-router-deploy
|
||||
|
||||
|
||||
This example is included in ``akka-samples/akka-sample-cluster``
|
||||
and you can try by starting nodes in different terminal windows. For example, starting 3
|
||||
service nodes and 1 client::
|
||||
|
||||
run-main sample.cluster.stats.StatsSampleOneMaster 2551
|
||||
|
||||
run-main sample.cluster.stats.StatsSampleOneMaster 2552
|
||||
|
||||
run-main sample.cluster.stats.StatsSampleOneMasterClient
|
||||
|
||||
run-main sample.cluster.stats.StatsSampleOneMaster
|
||||
|
||||
.. note:: The above example, especially the last part, will be simplified when the cluster handles automatic actor partitioning.
|
||||
|
||||
.. _cluster_jmx:
|
||||
|
||||
JMX
|
||||
^^^
|
||||
|
||||
Information and management of the cluster is available as JMX MBeans with the root name ``akka.Cluster``.
|
||||
The JMX information can be displayed with an ordinary JMX console such as JConsole or JVisualVM.
|
||||
|
||||
From JMX you can:
|
||||
|
||||
* see what members that are part of the cluster
|
||||
* see status of this node
|
||||
* join this node to another node in cluster
|
||||
* mark any node in the cluster as down
|
||||
* tell any node in the cluster to leave
|
||||
|
||||
Member nodes are identified with their address, in format `akka://actor-system-name@hostname:port`.
|
||||
|
||||
.. _cluster_command_line:
|
||||
|
||||
Command Line Management
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The cluster can be managed with the script `bin/akka-cluster` provided in the
|
||||
Akka distribution.
|
||||
|
||||
Run it without parameters to see instructions about how to use the script::
|
||||
|
||||
Usage: bin/akka-cluster <node-hostname:jmx-port> <command> ...
|
||||
|
||||
Supported commands are:
|
||||
join <node-url> - Sends request a JOIN node with the specified URL
|
||||
leave <node-url> - Sends a request for node with URL to LEAVE the cluster
|
||||
down <node-url> - Sends a request for marking node with URL as DOWN
|
||||
member-status - Asks the member node for its current status
|
||||
cluster-status - Asks the cluster for its current status (member ring,
|
||||
unavailable nodes, meta data etc.)
|
||||
leader - Asks the cluster who the current leader is
|
||||
is-singleton - Checks if the cluster is a singleton cluster (single
|
||||
node cluster)
|
||||
is-available - Checks if the member node is available
|
||||
is-running - Checks if the member node is running
|
||||
has-convergence - Checks if there is a cluster convergence
|
||||
Where the <node-url> should be on the format of 'akka://actor-system-name@hostname:port'
|
||||
|
||||
Examples: bin/akka-cluster localhost:9999 is-available
|
||||
bin/akka-cluster localhost:9999 join akka://MySystem@darkstar:2552
|
||||
bin/akka-cluster localhost:9999 cluster-status
|
||||
|
||||
|
||||
To be able to use the script you must enable remote monitoring and management when starting the JVMs of the cluster nodes,
|
||||
as described in `Monitoring and Management Using JMX Technology <http://docs.oracle.com/javase/6/docs/technotes/guides/management/agent.html>`_
|
||||
|
||||
Example of system properties to enable remote monitoring and management::
|
||||
|
||||
java -Dcom.sun.management.jmxremote.port=9999 \
|
||||
-Dcom.sun.management.jmxremote.authenticate=false \
|
||||
-Dcom.sun.management.jmxremote.ssl=false
|
||||
|
||||
.. _cluster_configuration:
|
||||
|
||||
Configuration
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
There are several configuration properties for the cluster. We refer to the following
|
||||
reference file for more information:
|
||||
|
||||
|
||||
.. literalinclude:: ../../../akka-cluster/src/main/resources/reference.conf
|
||||
:language: none
|
||||
|
||||
Cluster Scheduler
|
||||
-----------------
|
||||
|
||||
It is recommended that you change the ``tick-duration`` to 33 ms or less
|
||||
of the default scheduler when using cluster, if you don't need to have it
|
||||
configured to a longer duration for other reasons. If you don't do this
|
||||
a dedicated scheduler will be used for periodic tasks of the cluster, which
|
||||
introduce the extra overhead of another thread.
|
||||
|
||||
::
|
||||
|
||||
# shorter tick-duration of default scheduler when using cluster
|
||||
akka.scheduler.tick-duration.tick-duration = 33ms
|
||||
|
||||
|
||||
|
||||
644
akka-docs/rst/cluster/cluster.rst
Normal file
644
akka-docs/rst/cluster/cluster.rst
Normal file
|
|
@ -0,0 +1,644 @@
|
|||
|
||||
.. _cluster:
|
||||
|
||||
######################
|
||||
Cluster Specification
|
||||
######################
|
||||
|
||||
.. note:: This module is :ref:`experimental <experimental>`. This document describes the design concepts of the new clustering coming in Akka Coltrane. Not everything described here is implemented yet.
|
||||
|
||||
Intro
|
||||
=====
|
||||
|
||||
Akka Cluster provides a fault-tolerant, elastic, decentralized peer-to-peer
|
||||
cluster with no single point of failure (SPOF) or single point of bottleneck
|
||||
(SPOB). It implements a Dynamo-style system using gossip protocols, automatic
|
||||
failure detection, automatic partitioning, handoff, and cluster rebalancing. But
|
||||
with some differences due to the fact that it is not just managing passive data,
|
||||
but actors - active, sometimes stateful, components that also have requirements
|
||||
on message ordering, the number of active instances in the cluster, etc.
|
||||
|
||||
|
||||
Terms
|
||||
=====
|
||||
|
||||
These terms are used throughout the documentation.
|
||||
|
||||
**node**
|
||||
A logical member of a cluster. There could be multiple nodes on a physical
|
||||
machine. Defined by a `hostname:port` tuple.
|
||||
|
||||
**cluster**
|
||||
A set of nodes. Contains distributed Akka applications.
|
||||
|
||||
**partition**
|
||||
An actor or subtree of actors in the Akka application that is distributed
|
||||
within the cluster.
|
||||
|
||||
**partition point**
|
||||
The actor at the head of a partition. The point around which a partition is
|
||||
formed.
|
||||
|
||||
**partition path**
|
||||
Also referred to as the actor address. Has the format `actor1/actor2/actor3`
|
||||
|
||||
**instance count**
|
||||
The number of instances of a partition in the cluster. Also referred to as the
|
||||
``N-value`` of the partition.
|
||||
|
||||
**instance node**
|
||||
A node that an actor instance is assigned to.
|
||||
|
||||
**partition table**
|
||||
A mapping from partition path to a set of instance nodes (where the nodes are
|
||||
referred to by the ordinal position given the nodes in sorted order).
|
||||
|
||||
**leader**
|
||||
A single node in the cluster that acts as the leader. Managing cluster convergence,
|
||||
partitions, fail-over, rebalancing etc.
|
||||
|
||||
|
||||
Membership
|
||||
==========
|
||||
|
||||
A cluster is made up of a set of member nodes. The identifier for each node is a
|
||||
``hostname:port`` pair. An Akka application is distributed over a cluster with
|
||||
each node hosting some part of the application. Cluster membership and
|
||||
partitioning of the application are decoupled. A node could be a member of a
|
||||
cluster without hosting any actors.
|
||||
|
||||
|
||||
Singleton Cluster
|
||||
-----------------
|
||||
|
||||
If a node does not have a preconfigured contact point to join in the Akka
|
||||
configuration, then it is considered a singleton cluster (single node cluster)
|
||||
and will automatically transition from ``joining`` to ``up``. Singleton clusters
|
||||
can later explicitly send a ``Join`` message to another node to form a N-node
|
||||
cluster. It is also possible to link multiple N-node clusters by ``joining`` them.
|
||||
|
||||
|
||||
Gossip
|
||||
------
|
||||
|
||||
The cluster membership used in Akka is based on Amazon's `Dynamo`_ system and
|
||||
particularly the approach taken in Basho's' `Riak`_ distributed database.
|
||||
Cluster membership is communicated using a `Gossip Protocol`_, where the current
|
||||
state of the cluster is gossiped randomly through the cluster. Joining a cluster
|
||||
is initiated by issuing a ``Join`` command to one of the nodes in the cluster to
|
||||
join.
|
||||
|
||||
.. _Gossip Protocol: http://en.wikipedia.org/wiki/Gossip_protocol
|
||||
.. _Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
|
||||
.. _Riak: http://basho.com/technology/architecture/
|
||||
|
||||
|
||||
Vector Clocks
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
`Vector clocks`_ are an algorithm for generating a partial ordering of events in
|
||||
a distributed system and detecting causality violations.
|
||||
|
||||
We use vector clocks to to reconcile and merge differences in cluster state
|
||||
during gossiping. A vector clock is a set of (node, counter) pairs. Each update
|
||||
to the cluster state has an accompanying update to the vector clock.
|
||||
|
||||
One problem with vector clocks is that their history can over time be very long,
|
||||
which will both make comparisons take longer time as well as take up unnecessary
|
||||
memory. To solve that problem we do pruning of the vector clocks according to
|
||||
the `pruning algorithm`_ in Riak.
|
||||
|
||||
.. _Vector Clocks: http://en.wikipedia.org/wiki/Vector_clock
|
||||
.. _pruning algorithm: http://wiki.basho.com/Vector-Clocks.html#Vector-Clock-Pruning
|
||||
|
||||
|
||||
Gossip Convergence
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Information about the cluster converges at certain points of time. This is when
|
||||
all nodes have seen the same cluster state. Convergence is recognised by passing
|
||||
a map from node to current state version during gossip. This information is
|
||||
referred to as the gossip overview. When all versions in the overview are equal
|
||||
there is convergence. Gossip convergence cannot occur while any nodes are
|
||||
unreachable, either the nodes become reachable again, or the nodes need to be
|
||||
moved into the ``down`` or ``removed`` states (see section on `Member states`_
|
||||
below).
|
||||
|
||||
|
||||
Failure Detector
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
The failure detector is responsible for trying to detect if a node is
|
||||
unreachable from the rest of the cluster. For this we are using an
|
||||
implementation of `The Phi Accrual Failure Detector`_ by Hayashibara et al.
|
||||
|
||||
An accrual failure detector decouple monitoring and interpretation. That makes
|
||||
them applicable to a wider area of scenarios and more adequate to build generic
|
||||
failure detection services. The idea is that it is keeping a history of failure
|
||||
statistics, calculated from heartbeats received from other nodes, and is
|
||||
trying to do educated guesses by taking multiple factors, and how they
|
||||
accumulate over time, into account in order to come up with a better guess if a
|
||||
specific node is up or down. Rather than just answering "yes" or "no" to the
|
||||
question "is the node down?" it returns a ``phi`` value representing the
|
||||
likelihood that the node is down.
|
||||
|
||||
The ``threshold`` that is the basis for the calculation is configurable by the
|
||||
user. A low ``threshold`` is prone to generate many wrong suspicions but ensures
|
||||
a quick detection in the event of a real crash. Conversely, a high ``threshold``
|
||||
generates fewer mistakes but needs more time to detect actual crashes. The
|
||||
default ``threshold`` is 8 and is appropriate for most situations. However in
|
||||
cloud environments, such as Amazon EC2, the value could be increased to 12 in
|
||||
order to account for network issues that sometimes occur on such platforms.
|
||||
|
||||
.. _The Phi Accrual Failure Detector: http://ddg.jaist.ac.jp/pub/HDY+04.pdf
|
||||
|
||||
|
||||
Leader
|
||||
^^^^^^
|
||||
|
||||
After gossip convergence a ``leader`` for the cluster can be determined. There is no
|
||||
``leader`` election process, the ``leader`` can always be recognised deterministically
|
||||
by any node whenever there is gossip convergence. The ``leader`` is simply the first
|
||||
node in sorted order that is able to take the leadership role, where the only
|
||||
allowed member states for a ``leader`` are ``up``, ``leaving`` or ``exiting`` (see
|
||||
below for more information about member states).
|
||||
|
||||
The role of the ``leader`` is to shift members in and out of the cluster, changing
|
||||
``joining`` members to the ``up`` state or ``exiting`` members to the
|
||||
``removed`` state, and to schedule rebalancing across the cluster. Currently
|
||||
``leader`` actions are only triggered by receiving a new cluster state with gossip
|
||||
convergence but it may also be possible for the user to explicitly rebalance the
|
||||
cluster by specifying migrations, or to rebalance the cluster automatically
|
||||
based on metrics from member nodes. Metrics may be spread using the gossip
|
||||
protocol or possibly more efficiently using a *random chord* method, where the
|
||||
``leader`` contacts several random nodes around the cluster ring and each contacted
|
||||
node gathers information from their immediate neighbours, giving a random
|
||||
sampling of load information.
|
||||
|
||||
The ``leader`` also has the power, if configured so, to "auto-down" a node that
|
||||
according to the Failure Detector is considered unreachable. This means setting
|
||||
the unreachable node status to ``down`` automatically.
|
||||
|
||||
|
||||
Seed Nodes
|
||||
^^^^^^^^^^
|
||||
|
||||
The seed nodes are configured contact points for inital join of the cluster.
|
||||
When a new node is started started it sends a message to all seed nodes and
|
||||
then sends join command to the one that answers first.
|
||||
|
||||
It is possible to turn off automatic join.
|
||||
|
||||
|
||||
Gossip Protocol
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
A variation of *push-pull gossip* is used to reduce the amount of gossip
|
||||
information sent around the cluster. In push-pull gossip a digest is sent
|
||||
representing current versions but not actual values; the recipient of the gossip
|
||||
can then send back any values for which it has newer versions and also request
|
||||
values for which it has outdated versions. Akka uses a single shared state with
|
||||
a vector clock for versioning, so the variant of push-pull gossip used in Akka
|
||||
makes use of the gossip overview (containing the current state versions for all
|
||||
nodes) to only push the actual state as needed. This also allows any node to
|
||||
easily determine which other nodes have newer or older information, not just the
|
||||
nodes involved in a gossip exchange.
|
||||
|
||||
Periodically, the default is every 1 second, each node chooses another random
|
||||
node to initiate a round of gossip with. The choice of node is random but can
|
||||
also include extra gossiping nodes with either newer or older state versions.
|
||||
|
||||
The gossip overview contains the current state version for all nodes and also a
|
||||
list of unreachable nodes. Whenever a node receives a gossip overview it updates
|
||||
the `Failure Detector`_ with the liveness information.
|
||||
|
||||
The nodes defined as ``seed`` nodes are just regular member nodes whose only
|
||||
"special role" is to function as contact points in the cluster.
|
||||
|
||||
During each round of gossip exchange it sends Gossip to random node with
|
||||
newer or older state information, if any, based on the current gossip overview,
|
||||
with some probability. Otherwise Gossip to any random live node.
|
||||
|
||||
The gossiper only sends the gossip overview to the chosen node. The recipient of
|
||||
the gossip can use the gossip overview to determine whether:
|
||||
|
||||
1. it has a newer version of the gossip state, in which case it sends that back
|
||||
to the gossiper, or
|
||||
|
||||
2. it has an outdated version of the state, in which case the recipient requests
|
||||
the current state from the gossiper
|
||||
|
||||
If the recipient and the gossip have the same version then the gossip state is
|
||||
not sent or requested.
|
||||
|
||||
The main structures used in gossiping are the gossip overview and the gossip
|
||||
state::
|
||||
|
||||
GossipOverview {
|
||||
versions: Map[Node, VectorClock],
|
||||
unreachable: Set[Node]
|
||||
}
|
||||
|
||||
GossipState {
|
||||
version: VectorClock,
|
||||
members: SortedSet[Member],
|
||||
partitions: Tree[PartitionPath, Node],
|
||||
pending: Set[PartitionChange],
|
||||
meta: Option[Map[String, Array[Byte]]]
|
||||
}
|
||||
|
||||
Some of the other structures used are::
|
||||
|
||||
Node = InetSocketAddress
|
||||
|
||||
Member {
|
||||
node: Node,
|
||||
state: MemberState
|
||||
}
|
||||
|
||||
MemberState = Joining | Up | Leaving | Exiting | Down | Removed
|
||||
|
||||
PartitionChange {
|
||||
from: Node,
|
||||
to: Node,
|
||||
path: PartitionPath,
|
||||
status: PartitionChangeStatus
|
||||
}
|
||||
|
||||
PartitionChangeStatus = Awaiting | Complete
|
||||
|
||||
|
||||
Membership Lifecycle
|
||||
--------------------
|
||||
|
||||
A node begins in the ``joining`` state. Once all nodes have seen that the new
|
||||
node is joining (through gossip convergence) the ``leader`` will set the member
|
||||
state to ``up`` and can start assigning partitions to the new node.
|
||||
|
||||
If a node is leaving the cluster in a safe, expected manner then it switches to
|
||||
the ``leaving`` state. The ``leader`` will reassign partitions across the cluster
|
||||
(it is possible for a leaving node to itself be the ``leader``). When all partition
|
||||
handoff has completed then the node will change to the ``exiting`` state. Once
|
||||
all nodes have seen the exiting state (convergence) the ``leader`` will remove the
|
||||
node from the cluster, marking it as ``removed``.
|
||||
|
||||
If a node is unreachable then gossip convergence is not possible and therefore
|
||||
any ``leader`` actions are also not possible (for instance, allowing a node to
|
||||
become a part of the cluster, or changing actor distribution). To be able to
|
||||
move forward the state of the unreachable nodes must be changed. If the
|
||||
unreachable node is experiencing only transient difficulties then it can be
|
||||
explicitly marked as ``down`` using the ``down`` user action. When this node
|
||||
comes back up and begins gossiping it will automatically go through the joining
|
||||
process again. If the unreachable node will be permanently down then it can be
|
||||
removed from the cluster directly by shutting the actor system down or killing it
|
||||
through an external ``SIGKILL`` signal, invocation of ``System.exit(status)`` or
|
||||
similar. The cluster can, through the leader, also *auto-down* a node.
|
||||
|
||||
This means that nodes can join and leave the cluster at any point in time, i.e.
|
||||
provide cluster elasticity.
|
||||
|
||||
|
||||
State Diagram for the Member States
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. image:: images/member-states.png
|
||||
|
||||
|
||||
Member States
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
- **joining**
|
||||
transient state when joining a cluster
|
||||
|
||||
- **up**
|
||||
normal operating state
|
||||
|
||||
- **leaving** / **exiting**
|
||||
states during graceful removal
|
||||
|
||||
- **down**
|
||||
marked as down/offline/unreachable
|
||||
|
||||
- **removed**
|
||||
tombstone state (no longer a member)
|
||||
|
||||
|
||||
User Actions
|
||||
^^^^^^^^^^^^
|
||||
|
||||
- **join**
|
||||
join a single node to a cluster - can be explicit or automatic on
|
||||
startup if a node to join have been specified in the configuration
|
||||
|
||||
- **leave**
|
||||
tell a node to leave the cluster gracefully
|
||||
|
||||
- **down**
|
||||
mark a node as temporarily down
|
||||
|
||||
|
||||
Leader Actions
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
The ``leader`` has the following duties:
|
||||
|
||||
- shifting members in and out of the cluster
|
||||
|
||||
- joining -> up
|
||||
|
||||
- exiting -> removed
|
||||
|
||||
- partition distribution
|
||||
|
||||
- scheduling handoffs (pending changes)
|
||||
|
||||
- setting the partition table (partition path -> base node)
|
||||
|
||||
- Automatic rebalancing based on runtime metrics in the system (such as CPU,
|
||||
RAM, Garbage Collection, mailbox depth etc.)
|
||||
|
||||
|
||||
Partitioning
|
||||
============
|
||||
|
||||
Each partition (an actor or actor subtree) in the actor system is assigned to a
|
||||
set of nodes in the cluster. The actor at the head of the partition is referred
|
||||
to as the partition point. The mapping from partition path (actor address of the
|
||||
format "a/b/c") to instance nodes is stored in the partition table and is
|
||||
maintained as part of the cluster state through the gossip protocol. The
|
||||
partition table is only updated by the ``leader`` node. Currently the only possible
|
||||
partition points are *routed* actors.
|
||||
|
||||
Routed actors can have an instance count greater than one. The instance count is
|
||||
also referred to as the ``N-value``. If the ``N-value`` is greater than one then
|
||||
a set of instance nodes will be given in the partition table.
|
||||
|
||||
Note that in the first implementation there may be a restriction such that only
|
||||
top-level partitions are possible (the highest possible partition points are
|
||||
used and sub-partitioning is not allowed). Still to be explored in more detail.
|
||||
|
||||
The cluster ``leader`` determines the current instance count for a partition based
|
||||
on two axes: fault-tolerance and scaling.
|
||||
|
||||
Fault-tolerance determines a minimum number of instances for a routed actor
|
||||
(allowing N-1 nodes to crash while still maintaining at least one running actor
|
||||
instance). The user can specify a function from current number of nodes to the
|
||||
number of acceptable node failures: n: Int => f: Int where f < n.
|
||||
|
||||
Scaling reflects the number of instances needed to maintain good throughput and
|
||||
is influenced by metrics from the system, particularly a history of mailbox
|
||||
size, CPU load, and GC percentages. It may also be possible to accept scaling
|
||||
hints from the user that indicate expected load.
|
||||
|
||||
The balancing of partitions can be determined in a very simple way in the first
|
||||
implementation, where the overlap of partitions is minimized. Partitions are
|
||||
spread over the cluster ring in a circular fashion, with each instance node in
|
||||
the first available space. For example, given a cluster with ten nodes and three
|
||||
partitions, A, B, and C, having N-values of 4, 3, and 5; partition A would have
|
||||
instances on nodes 1-4; partition B would have instances on nodes 5-7; partition
|
||||
C would have instances on nodes 8-10 and 1-2. The only overlap is on nodes 1 and
|
||||
2.
|
||||
|
||||
The distribution of partitions is not limited, however, to having instances on
|
||||
adjacent nodes in the sorted ring order. Each instance can be assigned to any
|
||||
node and the more advanced load balancing algorithms will make use of this. The
|
||||
partition table contains a mapping from path to instance nodes. The partitioning
|
||||
for the above example would be::
|
||||
|
||||
A -> { 1, 2, 3, 4 }
|
||||
B -> { 5, 6, 7 }
|
||||
C -> { 8, 9, 10, 1, 2 }
|
||||
|
||||
If 5 new nodes join the cluster and in sorted order these nodes appear after the
|
||||
current nodes 2, 4, 5, 7, and 8, then the partition table could be updated to
|
||||
the following, with all instances on the same physical nodes as before::
|
||||
|
||||
A -> { 1, 2, 4, 5 }
|
||||
B -> { 7, 9, 10 }
|
||||
C -> { 12, 14, 15, 1, 2 }
|
||||
|
||||
When rebalancing is required the ``leader`` will schedule handoffs, gossiping a set
|
||||
of pending changes, and when each change is complete the ``leader`` will update the
|
||||
partition table.
|
||||
|
||||
|
||||
Handoff
|
||||
-------
|
||||
|
||||
Handoff for an actor-based system is different than for a data-based system. The
|
||||
most important point is that message ordering (from a given node to a given
|
||||
actor instance) may need to be maintained. If an actor is a singleton actor
|
||||
(only one instance possible throughout the cluster) then the cluster may also
|
||||
need to assure that there is only one such actor active at any one time. Both of
|
||||
these situations can be handled by forwarding and buffering messages during
|
||||
transitions.
|
||||
|
||||
A *graceful handoff* (one where the previous host node is up and running during
|
||||
the handoff), given a previous host node ``N1``, a new host node ``N2``, and an
|
||||
actor partition ``A`` to be migrated from ``N1`` to ``N2``, has this general
|
||||
structure:
|
||||
|
||||
1. the ``leader`` sets a pending change for ``N1`` to handoff ``A`` to ``N2``
|
||||
|
||||
2. ``N1`` notices the pending change and sends an initialization message to ``N2``
|
||||
|
||||
3. in response ``N2`` creates ``A`` and sends back a ready message
|
||||
|
||||
4. after receiving the ready message ``N1`` marks the change as
|
||||
complete and shuts down ``A``
|
||||
|
||||
5. the ``leader`` sees the migration is complete and updates the partition table
|
||||
|
||||
6. all nodes eventually see the new partitioning and use ``N2``
|
||||
|
||||
|
||||
Transitions
|
||||
^^^^^^^^^^^
|
||||
|
||||
There are transition times in the handoff process where different approaches can
|
||||
be used to give different guarantees.
|
||||
|
||||
|
||||
Migration Transition
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The first transition starts when ``N1`` initiates the moving of ``A`` and ends
|
||||
when ``N1`` receives the ready message, and is referred to as the *migration
|
||||
transition*.
|
||||
|
||||
The first question is; during the migration transition, should:
|
||||
|
||||
- ``N1`` continue to process messages for ``A``?
|
||||
|
||||
- Or is it important that no messages for ``A`` are processed on
|
||||
``N1`` once migration begins?
|
||||
|
||||
If it is okay for the previous host node ``N1`` to process messages during
|
||||
migration then there is nothing that needs to be done at this point.
|
||||
|
||||
If no messages are to be processed on the previous host node during migration
|
||||
then there are two possibilities: the messages are forwarded to the new host and
|
||||
buffered until the actor is ready, or the messages are simply dropped by
|
||||
terminating the actor and allowing the normal dead letter process to be used.
|
||||
|
||||
|
||||
Update Transition
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
The second transition begins when the migration is marked as complete and ends
|
||||
when all nodes have the updated partition table (when all nodes will use ``N2``
|
||||
as the host for ``A``, i.e. we have convergence) and is referred to as the
|
||||
*update transition*.
|
||||
|
||||
Once the update transition begins ``N1`` can forward any messages it receives
|
||||
for ``A`` to the new host ``N2``. The question is whether or not message
|
||||
ordering needs to be preserved. If messages sent to the previous host node
|
||||
``N1`` are being forwarded, then it is possible that a message sent to ``N1``
|
||||
could be forwarded after a direct message to the new host ``N2``, breaking
|
||||
message ordering from a client to actor ``A``.
|
||||
|
||||
In this situation ``N2`` can keep a buffer for messages per sending node. Each
|
||||
buffer is flushed and removed when an acknowledgement (``ack``) message has been
|
||||
received. When each node in the cluster sees the partition update it first sends
|
||||
an ``ack`` message to the previous host node ``N1`` before beginning to use
|
||||
``N2`` as the new host for ``A``. Any messages sent from the client node
|
||||
directly to ``N2`` will be buffered. ``N1`` can count down the number of acks to
|
||||
determine when no more forwarding is needed. The ``ack`` message from any node
|
||||
will always follow any other messages sent to ``N1``. When ``N1`` receives the
|
||||
``ack`` message it also forwards it to ``N2`` and again this ``ack`` message
|
||||
will follow any other messages already forwarded for ``A``. When ``N2`` receives
|
||||
an ``ack`` message, the buffer for the sending node can be flushed and removed.
|
||||
Any subsequent messages from this sending node can be queued normally. Once all
|
||||
nodes in the cluster have acknowledged the partition change and ``N2`` has
|
||||
cleared all buffers, the handoff is complete and message ordering has been
|
||||
preserved. In practice the buffers should remain small as it is only those
|
||||
messages sent directly to ``N2`` before the acknowledgement has been forwarded
|
||||
that will be buffered.
|
||||
|
||||
|
||||
Graceful Handoff
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
A more complete process for graceful handoff would be:
|
||||
|
||||
1. the ``leader`` sets a pending change for ``N1`` to handoff ``A`` to ``N2``
|
||||
|
||||
|
||||
2. ``N1`` notices the pending change and sends an initialization message to
|
||||
``N2``. Options:
|
||||
|
||||
a. keep ``A`` on ``N1`` active and continuing processing messages as normal
|
||||
|
||||
b. ``N1`` forwards all messages for ``A`` to ``N2``
|
||||
|
||||
c. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
|
||||
becoming dead letters)
|
||||
|
||||
|
||||
3. in response ``N2`` creates ``A`` and sends back a ready message. Options:
|
||||
|
||||
a. ``N2`` simply processes messages for ``A`` as normal
|
||||
|
||||
b. ``N2`` creates a buffer per sending node for ``A``. Each buffer is
|
||||
opened (flushed and removed) when an acknowledgement for the sending
|
||||
node has been received (via ``N1``)
|
||||
|
||||
|
||||
4. after receiving the ready message ``N1`` marks the change as complete. Options:
|
||||
|
||||
a. ``N1`` forwards all messages for ``A`` to ``N2`` during the update transition
|
||||
|
||||
b. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
|
||||
becoming dead letters)
|
||||
|
||||
|
||||
5. the ``leader`` sees the migration is complete and updates the partition table
|
||||
|
||||
|
||||
6. all nodes eventually see the new partitioning and use ``N2``
|
||||
|
||||
i. each node sends an acknowledgement message to ``N1``
|
||||
|
||||
ii. when ``N1`` receives the acknowledgement it can count down the pending
|
||||
acknowledgements and remove forwarding when complete
|
||||
|
||||
iii. when ``N2`` receives the acknowledgement it can open the buffer for the
|
||||
sending node (if buffers are used)
|
||||
|
||||
|
||||
The default approach is to take options 2a, 3a, and 4a - allowing ``A`` on
|
||||
``N1`` to continue processing messages during migration and then forwarding any
|
||||
messages during the update transition. This assumes stateless actors that do not
|
||||
have a dependency on message ordering from any given source.
|
||||
|
||||
- If an actor has a distributed durable mailbox then nothing needs to be done,
|
||||
other than migrating the actor.
|
||||
|
||||
- If message ordering needs to be maintained during the update transition then
|
||||
option 3b can be used, creating buffers per sending node.
|
||||
|
||||
- If the actors are robust to message send failures then the dropping messages
|
||||
approach can be used (with no forwarding or buffering needed).
|
||||
|
||||
- If an actor is a singleton (only one instance possible throughout the cluster)
|
||||
and state is transferred during the migration initialization, then options 2b
|
||||
and 3b would be required.
|
||||
|
||||
|
||||
Stateful Actor Replication
|
||||
==========================
|
||||
|
||||
Support for stateful singleton actors will come in future releases of Akka, and
|
||||
is scheduled for Akka 2.2. Having a Dynamo base for the clustering already we
|
||||
should use the same infrastructure to provide stateful actor clustering and
|
||||
datastore as well. The stateful actor clustering should be layered on top of the
|
||||
distributed datastore. See the next section for a rough outline on how the
|
||||
distributed datastore could be implemented.
|
||||
|
||||
|
||||
Implementing a Dynamo-style Distributed Database on top of Akka Cluster
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
The missing pieces to implement a full Dynamo-style eventually consistent data
|
||||
storage on top of the Akka Cluster as described in this document are:
|
||||
|
||||
- Configuration of ``READ`` and ``WRITE`` consistency levels according to the
|
||||
``N/R/W`` numbers defined in the Dynamo paper.
|
||||
|
||||
- R = read replica count
|
||||
|
||||
- W = write replica count
|
||||
|
||||
- N = replication factor
|
||||
|
||||
- Q = QUORUM = N / 2 + 1
|
||||
|
||||
- W + R > N = full consistency
|
||||
|
||||
- Define a versioned data message wrapper::
|
||||
|
||||
Versioned[T](hash: Long, version: VectorClock, data: T)
|
||||
|
||||
- Define a single system data broker actor on each node that uses a ``Consistent
|
||||
Hashing Router`` and that have instances on all other nodes in the node ring.
|
||||
|
||||
- For ``WRITE``:
|
||||
|
||||
1. Wrap data in a ``Versioned Message``
|
||||
|
||||
2. Send a ``Versioned Message`` with the data is sent to a number of nodes
|
||||
matching the ``W-value``.
|
||||
|
||||
- For ``READ``:
|
||||
|
||||
1. Read in the ``Versioned Message`` with the data from as many replicas as
|
||||
you need for the consistency level required by the ``R-value``.
|
||||
|
||||
2. Do comparison on the versions (using `Vector Clocks`_)
|
||||
|
||||
3. If the versions differ then do `Read Repair`_ to update the inconsistent
|
||||
nodes.
|
||||
|
||||
4. Return the latest versioned data.
|
||||
|
||||
.. _Read Repair: http://wiki.apache.org/cassandra/ReadRepair
|
||||
BIN
akka-docs/rst/cluster/images/member-states.png
Normal file
BIN
akka-docs/rst/cluster/images/member-states.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 38 KiB |
BIN
akka-docs/rst/cluster/images/more.png
Normal file
BIN
akka-docs/rst/cluster/images/more.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 1.5 KiB |
8
akka-docs/rst/cluster/index.rst
Normal file
8
akka-docs/rst/cluster/index.rst
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
Cluster
|
||||
=======
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
cluster
|
||||
cluster-usage
|
||||
Loading…
Add table
Add a link
Reference in a new issue