add preprocessor for RST docs, see #2461 and #2431

The idea is to filter the sources, replacing @<var>@ occurrences with the mapping for <var> (which is currently hard-coded). @@ -> @. In order to make this work, I had to move the doc sources one directory down (into akka-docs/rst) so that the filtered result could be in a sibling directory so that relative links (to _sphinx plugins or real code) would continue to work. While I was at it I also changed it so that WARNINGs and ERRORs are not swallowed into the debug dump anymore but printed at [warn] level (minimum). One piece of fallout is that the (online) html build is now run after the normal one, not in parallel.
2012-09-21 10:47:58 +02:00 · 2012-09-21 10:47:58 +02:00 · 9bc01ae265
commit 9bc01ae265
parent c0f60da8cc
266 changed files with 270 additions and 182 deletions
--- a/akka-docs/rst/cluster/cluster-usage.rst
+++ b/akka-docs/rst/cluster/cluster-usage.rst
@ -0,0 +1,430 @@
+
+.. _cluster_usage:
+
+###############
+ Cluster Usage
+###############
+
+.. note:: This module is :ref:`experimental <experimental>`. This document describes how to use the features implemented so far. More features are coming in Akka Coltrane. Track progress of the Coltrane milestone in `Assembla <http://www.assembla.com/spaces/akka/tickets>`_ and the `Roadmap <https://docs.google.com/document/d/18W9-fKs55wiFNjXL9q50PYOnR7-nnsImzJqHOPPbM4E/edit?hl=en_US>`_.
+
+For introduction to the Akka Cluster concepts please see :ref:`cluster`.
+
+Preparing Your Project for Clustering
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The Akka cluster is a separate jar file. Make sure that you have the following dependency in your project:
+
+.. parsed-literal::
+
+  "com.typesafe.akka" %% "akka-cluster" % "@version@" @crossString@
+
+If you are using the latest nightly build you should pick a timestamped Akka
+version from
+`<http://repo.typesafe.com/typesafe/snapshots/com/typesafe/akka/akka-cluster_@binVersion@/>`_.
+We recommend against using ``SNAPSHOT`` in order to obtain stable builds.
+
+A Simple Cluster Example
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The following small program together with its configuration starts an ``ActorSystem`` 
+with the Cluster extension enabled. It joins the cluster and logs some membership events.
+
+Try it out:
+
+1. Add the following ``application.conf`` in your project, place it in ``src/main/resources``:
+
+
+.. literalinclude:: ../../../akka-samples/akka-sample-cluster/src/main/resources/application.conf
+   :language: none
+
+To enable cluster capabilities in your Akka project you should, at a minimum, add the :ref:`remoting-scala`
+settings, but with ``akka.cluster.ClusterActorRefProvider``.
+The ``akka.cluster.seed-nodes`` and cluster extension should normally also be added to your 
+``application.conf`` file.
+
+The seed nodes are configured contact points for initial, automatic, join of the cluster.
+
+Note that if you are going to start the nodes on different machines you need to specify the
+ip-addresses or host names of the machines in ``application.conf`` instead of ``127.0.0.1``
+
+2. Add the following main program to your project, place it in ``src/main/scala``:
+
+.. literalinclude:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/simple/SimpleClusterApp.scala   
+   :language: scala
+
+
+3. Start the first seed node. Open a sbt session in one terminal window and run::
+
+     run-main sample.cluster.simple.SimpleClusterApp 2551
+
+2551 corresponds to the port of the first seed-nodes element in the configuration.
+In the log output you see that the cluster node has been started and changed status to 'Up'.
+
+4. Start the second seed node. Open a sbt session in another terminal window and run::
+
+      run-main sample.cluster.simple.SimpleClusterApp 2552
+
+
+2552 corresponds to the port of the second seed-nodes element in the configuration.
+In the log output you see that the cluster node has been started and joins the other seed node
+and becomes a member of the cluster. It's status changed to 'Up'.
+
+Switch over to the first terminal window and see in the log output that the member joined.
+
+5. Start another node. Open a sbt session in yet another terminal window and run::
+
+      run-main sample.cluster.simple.SimpleClusterApp
+
+Now you don't need to specify the port number, and it will use a random available port.
+It joins one of the configured seed nodes. Look at the log output in the different terminal
+windows.
+
+Start even more nodes in the same way, if you like.
+
+6. Shut down one of the nodes by pressing 'ctrl-c' in one of the terminal windows.
+The other nodes will detect the failure after a while, which you can see in the log
+output in the other terminals.
+
+Look at the source code of the program again. What it does is to create an actor
+and register it as subscriber of certain cluster events. It gets notified with 
+an snapshot event, ``CurrentClusterState`` that holds full state information of
+the cluster. After that it receives events for changes that happen in the cluster.
+
+Automatic vs. Manual Joining
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You may decide if joining to the cluster should be done automatically or manually.
+By default it is automatic and you need to define the seed nodes in configuration
+so that a new node has an initial contact point. When a new node is started it 
+sends a message to all seed nodes and then sends join command to the one that 
+answers first. If no one of the seed nodes replied (might not be started yet)
+it retries this procedure until successful or shutdown.
+
+There is one thing to be aware of regarding the seed node configured as the 
+first element in the ``seed-nodes`` configuration list.
+The seed nodes can be started in any order and it is not necessary to have all
+seed nodes running, but the first seed node must be started when initially 
+starting a cluster, otherwise the other seed-nodes will not become initialized 
+and no other node can join the cluster. Once more than two seed nodes have been 
+started it is no problem to shut down the first seed node. If it goes down it 
+must be manually joined to the cluster again. 
+Automatic joining of the first seed node is not possible, it would only join 
+itself. It is only the first seed node that has this restriction.
+
+You can disable automatic joining with configuration:
+
+  akka.cluster.auto-join = off
+
+Then you need to join manually, using :ref:`cluster_jmx` or :ref:`cluster_command_line`. 
+You can join to any node in the cluster. It doesn't have to be configured as
+seed node. If you are not using auto-join there is no need to configure 
+seed nodes at all.
+
+Joining can also be performed programatically with ``Cluster(system).join``.
+
+
+Automatic vs. Manual Downing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When a member is considered by the failure detector to be unreachable the
+leader is not allowed to perform its duties, such as changing status of
+new joining members to 'Up'. The status of the unreachable member must be
+changed to 'Down'. This can be performed automatically or manually. By 
+default it must be done manually, using using :ref:`cluster_jmx` or 
+:ref:`cluster_command_line`.
+
+It can also be performed programatically with ``Cluster(system).down``.
+
+You can enable automatic downing with configuration:
+
+  akka.cluster.auto-down = on
+
+Be aware of that using auto-down implies that two separate clusters will 
+automatically be formed in case of network partition. That might be 
+desired by some applications but not by others. 
+
+Subscribe to Cluster Events
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can subscribe to change notifications of the cluster membership by using
+``Cluster(system).subscribe``. A snapshot of the full state,
+``akka.cluster.ClusterEvent.CurrentClusterState``, is sent to the subscriber
+as the first event, followed by events for incremental updates.
+
+There are several types of change events, consult the API documentation
+of classes that extends ``akka.cluster.ClusterEvent.ClusterDomainEvent`` 
+for details about the events.
+
+Worker Dial-in Example
+----------------------
+
+Let's take a look at an example that illustrates how workers, here named *backend*,
+can detect and register to new master nodes, here named *frontend*.
+
+The example application provides a service to transform text. When some text
+is sent to one of the frontend services, it will be delegated to one of the
+backend workers, which performs the transformation job, and sends the result back to
+the original client. New backend nodes, as well as new frontend nodes, can be
+added or removed to the cluster dynamically.
+
+In this example the following imports are used:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#imports
+
+Messages:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#messages
+
+The backend worker that performs the transformation job:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#backend
+
+Note that the ``TransformationBackend`` actor subscribes to cluster events to detect new,
+potential, frontend nodes, and send them a registration message so that they know
+that they can use the backend worker.
+
+The frontend that receives user jobs and delegates to one of the registered backend workers:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/transformation/TransformationSample.scala#frontend
+
+Note that the ``TransformationFrontend`` actor watch the registered backend
+to be able to remove it from its list of availble backend workers.
+Death watch uses the cluster failure detector for nodes in the cluster, i.e. it detects 
+network failures and JVM crashes, in addition to graceful termination of watched
+actor.
+
+This example is included in ``akka-samples/akka-sample-cluster``
+and you can try by starting nodes in different terminal windows. For example, starting 2 
+frontend nodes and 3 backend nodes::
+
+  sbt
+
+  project akka-sample-cluster-experimental
+
+  run-main sample.cluster.transformation.TransformationFrontend 2551
+
+  run-main sample.cluster.transformation.TransformationBackend 2552
+
+  run-main sample.cluster.transformation.TransformationBackend
+
+  run-main sample.cluster.transformation.TransformationBackend
+
+  run-main sample.cluster.transformation.TransformationFrontend
+
+
+.. note:: The above example should probably be designed as two separate, frontend/backend, clusters, when there is a `cluster client for decoupling clusters <https://www.assembla.com/spaces/akka/tickets/1165>`_.
+
+Cluster Aware Routers
+^^^^^^^^^^^^^^^^^^^^^
+
+All :ref:`routers <routing-scala>` can be made aware of member nodes in the cluster, i.e.
+deploying new routees or looking up routees on nodes in the cluster.
+When a node becomes unavailble or leaves the cluster the routees of that node are 
+automatically unregistered from the router. When new nodes join the cluster additional
+routees are added to the router, according to the configuration.
+
+When using a router with routees looked up on the cluster member nodes, i.e. the routees
+are already running, the configuration for a router looks like this:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/multi-jvm/scala/sample/cluster/stats/StatsSampleSpec.scala#router-lookup-config
+
+It's the relative actor path defined in ``routees-path`` that identify what actor to lookup.
+
+``nr-of-instances`` defines total number of routees in the cluster, but there will not be
+more than one per node. Setting ``nr-of-instances`` to a high value will result in new routees 
+added to the router when nodes join the cluster. 
+
+The same type of router could also have been defined in code:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#router-lookup-in-code
+
+When using a router with routees created and deployed on the cluster member nodes
+the configuration for a router looks like this:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/multi-jvm/scala/sample/cluster/stats/StatsSampleSingleMasterSpec.scala#router-deploy-config
+
+
+``nr-of-instances`` defines total number of routees in the cluster, but the number of routees 
+per node, ``max-nr-of-instances-per-node``, will not be exceeded. Setting ``nr-of-instances`` 
+to a high value will result in creating and deploying additional routees when new nodes join 
+the cluster.
+
+The same type of router could also have been defined in code:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#router-deploy-in-code
+
+See :ref:`cluster_configuration` section for further descriptions of the settings.
+
+
+Router Example
+--------------
+
+Let's take a look at how to use cluster aware routers.
+
+The example application provides a service to calculate statistics for a text.
+When some text is sent to the service it splits it into words, and delegates the task
+to count number of characters in each word to a separate worker, a routee of a router.
+The character count for each word is sent back to an aggregator that calculates 
+the average number of characters per word when all results have been collected.
+
+In this example we use the following imports:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#imports
+
+Messages:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#messages
+
+The worker that counts number of characters in each word:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#worker
+
+The service that receives text from users and splits it up into words, delegates to workers and aggregates:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#service
+
+
+Note, nothing cluster specific so far, just plain actors.
+
+We can use these actors with two different types of router setup. Either with lookup of routees,
+or with create and deploy of routees. Remember, routees are the workers in this case.
+
+We start with the router setup with lookup of routees. All nodes start ``StatsService`` and
+``StatsWorker`` actors and the router is configured with ``routees-path``:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#start-router-lookup
+
+This means that user requests can be sent to ``StatsService`` on any node and it will use 
+``StatsWorker`` on all nodes. There can only be one worker per node, but that worker could easily 
+fan out to local children if more parallelism is needed.
+
+This example is included in ``akka-samples/akka-sample-cluster``
+and you can try by starting nodes in different terminal windows. For example, starting 3 
+service nodes and 1 client::
+
+  run-main sample.cluster.stats.StatsSample 2551
+
+  run-main sample.cluster.stats.StatsSample 2552
+
+  run-main sample.cluster.stats.StatsSampleClient
+
+  run-main sample.cluster.stats.StatsSample
+
+The above setup is nice for this example, but we will also take a look at how to use 
+a single master node that creates and deploys workers. To keep track of a single
+master we need one additional actor:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#facade
+
+The ``StatsFacade`` receives text from users and delegates to the current ``StatsService``, the single
+master. It listens to cluster events to create or lookup the ``StatsService`` depending on if
+it is on the same same node or on another node. We run the master on the same node as the leader of
+the cluster members, which is nothing more than the address currently sorted first in the member ring, 
+i.e. it can change when new nodes join or when current leader leaves.
+
+All nodes start ``StatsFacade`` and the router is now configured like this:
+
+.. includecode:: ../../../akka-samples/akka-sample-cluster/src/main/scala/sample/cluster/stats/StatsSample.scala#start-router-deploy
+
+
+This example is included in ``akka-samples/akka-sample-cluster``
+and you can try by starting nodes in different terminal windows. For example, starting 3 
+service nodes and 1 client::
+
+  run-main sample.cluster.stats.StatsSampleOneMaster 2551
+
+  run-main sample.cluster.stats.StatsSampleOneMaster 2552
+
+  run-main sample.cluster.stats.StatsSampleOneMasterClient
+
+  run-main sample.cluster.stats.StatsSampleOneMaster
+
+.. note:: The above example, especially the last part, will be simplified when the cluster handles automatic actor partitioning.
+
+.. _cluster_jmx:
+
+JMX
+^^^
+
+Information and management of the cluster is available as JMX MBeans with the root name ``akka.Cluster``.
+The JMX information can be displayed with an ordinary JMX console such as JConsole or JVisualVM.
+
+From JMX you can:
+
+* see what members that are part of the cluster
+* see status of this node
+* join this node to another node in cluster
+* mark any node in the cluster as down
+* tell any node in the cluster to leave
+
+Member nodes are identified with their address, in format `akka://actor-system-name@hostname:port`.
+
+.. _cluster_command_line:
+
+Command Line Management
+^^^^^^^^^^^^^^^^^^^^^^^
+
+The cluster can be managed with the script `bin/akka-cluster` provided in the 
+Akka distribution.
+
+Run it without parameters to see instructions about how to use the script::
+
+  Usage: bin/akka-cluster <node-hostname:jmx-port> <command> ...
+
+  Supported commands are:
+             join <node-url> - Sends request a JOIN node with the specified URL
+            leave <node-url> - Sends a request for node with URL to LEAVE the cluster
+             down <node-url> - Sends a request for marking node with URL as DOWN
+               member-status - Asks the member node for its current status
+              cluster-status - Asks the cluster for its current status (member ring, 
+                               unavailable nodes, meta data etc.)
+                      leader - Asks the cluster who the current leader is
+                is-singleton - Checks if the cluster is a singleton cluster (single 
+                               node cluster)
+                is-available - Checks if the member node is available
+                  is-running - Checks if the member node is running
+             has-convergence - Checks if there is a cluster convergence
+  Where the <node-url> should be on the format of 'akka://actor-system-name@hostname:port'
+
+  Examples: bin/akka-cluster localhost:9999 is-available
+            bin/akka-cluster localhost:9999 join akka://MySystem@darkstar:2552
+            bin/akka-cluster localhost:9999 cluster-status
+
+
+To be able to use the script you must enable remote monitoring and management when starting the JVMs of the cluster nodes, 
+as described in `Monitoring and Management Using JMX Technology <http://docs.oracle.com/javase/6/docs/technotes/guides/management/agent.html>`_
+
+Example of system properties to enable remote monitoring and management::
+
+  java -Dcom.sun.management.jmxremote.port=9999 \
+  -Dcom.sun.management.jmxremote.authenticate=false \
+  -Dcom.sun.management.jmxremote.ssl=false
+
+.. _cluster_configuration:
+
+Configuration
+^^^^^^^^^^^^^
+
+There are several configuration properties for the cluster. We refer to the following
+reference file for more information:
+
+
+.. literalinclude:: ../../../akka-cluster/src/main/resources/reference.conf
+   :language: none
+
+Cluster Scheduler
+-----------------
+
+It is recommended that you change the ``tick-duration`` to 33 ms or less
+of the default scheduler when using cluster, if you don't need to have it 
+configured to a longer duration for other reasons. If you don't do this
+a dedicated scheduler will be used for periodic tasks of the cluster, which
+introduce the extra overhead of another thread.
+
+::
+
+  # shorter tick-duration of default scheduler when using cluster
+  akka.scheduler.tick-duration.tick-duration = 33ms
+
+
+
--- a/akka-docs/rst/cluster/cluster.rst
+++ b/akka-docs/rst/cluster/cluster.rst
@ -0,0 +1,644 @@
+
+.. _cluster:
+
+######################
+ Cluster Specification
+######################
+
+.. note:: This module is :ref:`experimental <experimental>`. This document describes the design concepts of the new clustering coming in Akka Coltrane. Not everything described here is implemented yet.
+
+Intro
+=====
+
+Akka Cluster provides a fault-tolerant, elastic, decentralized peer-to-peer
+cluster with no single point of failure (SPOF) or single point of bottleneck
+(SPOB). It implements a Dynamo-style system using gossip protocols, automatic
+failure detection, automatic partitioning, handoff, and cluster rebalancing. But
+with some differences due to the fact that it is not just managing passive data,
+but actors - active, sometimes stateful, components that also have requirements
+on message ordering, the number of active instances in the cluster, etc.
+
+
+Terms
+=====
+
+These terms are used throughout the documentation.
+
+**node**
+  A logical member of a cluster. There could be multiple nodes on a physical
+  machine. Defined by a `hostname:port` tuple.
+
+**cluster**
+  A set of nodes. Contains distributed Akka applications.
+
+**partition**
+  An actor or subtree of actors in the Akka application that is distributed
+  within the cluster.
+
+**partition point**
+  The actor at the head of a partition. The point around which a partition is
+  formed.
+
+**partition path**
+  Also referred to as the actor address. Has the format `actor1/actor2/actor3`
+
+**instance count**
+  The number of instances of a partition in the cluster. Also referred to as the
+  ``N-value`` of the partition.
+
+**instance node**
+  A node that an actor instance is assigned to.
+
+**partition table**
+  A mapping from partition path to a set of instance nodes (where the nodes are
+  referred to by the ordinal position given the nodes in sorted order).
+
+**leader**
+  A single node in the cluster that acts as the leader. Managing cluster convergence,
+  partitions, fail-over, rebalancing etc.
+
+
+Membership
+==========
+
+A cluster is made up of a set of member nodes. The identifier for each node is a
+``hostname:port`` pair. An Akka application is distributed over a cluster with
+each node hosting some part of the application. Cluster membership and
+partitioning of the application are decoupled. A node could be a member of a
+cluster without hosting any actors.
+
+
+Singleton Cluster
+-----------------
+
+If a node does not have a preconfigured contact point to join in the Akka
+configuration, then it is considered a singleton cluster (single node cluster)
+and will automatically transition from ``joining`` to ``up``. Singleton clusters
+can later explicitly send a ``Join`` message to another node to form a N-node
+cluster. It is also possible to link multiple N-node clusters by ``joining`` them.
+
+
+Gossip
+------
+
+The cluster membership used in Akka is based on Amazon's `Dynamo`_ system and
+particularly the approach taken in Basho's' `Riak`_ distributed database.
+Cluster membership is communicated using a `Gossip Protocol`_, where the current
+state of the cluster is gossiped randomly through the cluster. Joining a cluster
+is initiated by issuing a ``Join`` command to one of the nodes in the cluster to
+join.
+
+.. _Gossip Protocol: http://en.wikipedia.org/wiki/Gossip_protocol
+.. _Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
+.. _Riak: http://basho.com/technology/architecture/
+
+
+Vector Clocks
+^^^^^^^^^^^^^
+
+`Vector clocks`_ are an algorithm for generating a partial ordering of events in
+a distributed system and detecting causality violations.
+
+We use vector clocks to to reconcile and merge differences in cluster state
+during gossiping. A vector clock is a set of (node, counter) pairs. Each update
+to the cluster state has an accompanying update to the vector clock.
+
+One problem with vector clocks is that their history can over time be very long,
+which will both make comparisons take longer time as well as take up unnecessary
+memory. To solve that problem we do pruning of the vector clocks according to
+the `pruning algorithm`_ in Riak.
+
+.. _Vector Clocks: http://en.wikipedia.org/wiki/Vector_clock
+.. _pruning algorithm: http://wiki.basho.com/Vector-Clocks.html#Vector-Clock-Pruning
+
+
+Gossip Convergence
+^^^^^^^^^^^^^^^^^^
+
+Information about the cluster converges at certain points of time. This is when
+all nodes have seen the same cluster state. Convergence is recognised by passing
+a map from node to current state version during gossip. This information is
+referred to as the gossip overview. When all versions in the overview are equal
+there is convergence. Gossip convergence cannot occur while any nodes are
+unreachable, either the nodes become reachable again, or the nodes need to be
+moved into the ``down`` or ``removed`` states (see section on `Member states`_
+below).
+
+
+Failure Detector
+^^^^^^^^^^^^^^^^
+
+The failure detector is responsible for trying to detect if a node is
+unreachable from the rest of the cluster. For this we are using an
+implementation of `The Phi Accrual Failure Detector`_ by Hayashibara et al.
+
+An accrual failure detector decouple monitoring and interpretation. That makes
+them applicable to a wider area of scenarios and more adequate to build generic
+failure detection services. The idea is that it is keeping a history of failure
+statistics, calculated from heartbeats received from other nodes, and is
+trying to do educated guesses by taking multiple factors, and how they
+accumulate over time, into account in order to come up with a better guess if a
+specific node is up or down. Rather than just answering "yes" or "no" to the
+question "is the node down?" it returns a ``phi`` value representing the
+likelihood that the node is down.
+
+The ``threshold`` that is the basis for the calculation is configurable by the
+user. A low ``threshold`` is prone to generate many wrong suspicions but ensures
+a quick detection in the event of a real crash. Conversely, a high ``threshold``
+generates fewer mistakes but needs more time to detect actual crashes. The
+default ``threshold`` is 8 and is appropriate for most situations. However in
+cloud environments, such as Amazon EC2, the value could be increased to 12 in
+order to account for network issues that sometimes occur on such platforms.
+
+.. _The Phi Accrual Failure Detector: http://ddg.jaist.ac.jp/pub/HDY+04.pdf
+
+
+Leader
+^^^^^^
+
+After gossip convergence a ``leader`` for the cluster can be determined. There is no
+``leader`` election process, the ``leader`` can always be recognised deterministically
+by any node whenever there is gossip convergence. The ``leader`` is simply the first
+node in sorted order that is able to take the leadership role, where the only
+allowed member states for a ``leader`` are ``up``, ``leaving`` or ``exiting`` (see
+below for more information about member states).
+
+The role of the ``leader`` is to shift members in and out of the cluster, changing
+``joining`` members to the ``up`` state or ``exiting`` members to the
+``removed`` state, and to schedule rebalancing across the cluster. Currently
+``leader`` actions are only triggered by receiving a new cluster state with gossip
+convergence but it may also be possible for the user to explicitly rebalance the
+cluster by specifying migrations, or to rebalance the cluster automatically
+based on metrics from member nodes. Metrics may be spread using the gossip
+protocol or possibly more efficiently using a *random chord* method, where the
+``leader`` contacts several random nodes around the cluster ring and each contacted
+node gathers information from their immediate neighbours, giving a random
+sampling of load information.
+
+The ``leader`` also has the power, if configured so, to "auto-down" a node that
+according to the Failure Detector is considered unreachable. This means setting
+the unreachable node status to ``down`` automatically.
+
+
+Seed Nodes
+^^^^^^^^^^
+
+The seed nodes are configured contact points for inital join of the cluster.
+When a new node is started started it sends a message to all seed nodes and 
+then sends join command to the one that answers first.
+
+It is possible to turn off automatic join.
+
+
+Gossip Protocol
+^^^^^^^^^^^^^^^
+
+A variation of *push-pull gossip* is used to reduce the amount of gossip
+information sent around the cluster. In push-pull gossip a digest is sent
+representing current versions but not actual values; the recipient of the gossip
+can then send back any values for which it has newer versions and also request
+values for which it has outdated versions. Akka uses a single shared state with
+a vector clock for versioning, so the variant of push-pull gossip used in Akka
+makes use of the gossip overview (containing the current state versions for all
+nodes) to only push the actual state as needed. This also allows any node to
+easily determine which other nodes have newer or older information, not just the
+nodes involved in a gossip exchange.
+
+Periodically, the default is every 1 second, each node chooses another random
+node to initiate a round of gossip with. The choice of node is random but can
+also include extra gossiping nodes with either newer or older state versions.
+
+The gossip overview contains the current state version for all nodes and also a
+list of unreachable nodes. Whenever a node receives a gossip overview it updates
+the `Failure Detector`_ with the liveness information.
+
+The nodes defined as ``seed`` nodes are just regular member nodes whose only
+"special role" is to function as contact points in the cluster.
+
+During each round of gossip exchange it sends Gossip to random node with 
+newer or older state information, if any, based on the current gossip overview, 
+with some probability. Otherwise Gossip to any random live node.
+
+The gossiper only sends the gossip overview to the chosen node. The recipient of
+the gossip can use the gossip overview to determine whether:
+
+1. it has a newer version of the gossip state, in which case it sends that back
+   to the gossiper, or
+
+2. it has an outdated version of the state, in which case the recipient requests
+   the current state from the gossiper
+
+If the recipient and the gossip have the same version then the gossip state is
+not sent or requested.
+
+The main structures used in gossiping are the gossip overview and the gossip
+state::
+
+  GossipOverview {
+    versions: Map[Node, VectorClock],
+    unreachable: Set[Node]
+  }
+
+ GossipState {
+    version: VectorClock,
+    members: SortedSet[Member],
+    partitions: Tree[PartitionPath, Node],
+    pending: Set[PartitionChange],
+    meta: Option[Map[String, Array[Byte]]]
+  }
+
+Some of the other structures used are::
+
+  Node = InetSocketAddress
+
+  Member {
+    node: Node,
+    state: MemberState
+  }
+
+  MemberState = Joining | Up | Leaving | Exiting | Down | Removed
+
+  PartitionChange {
+    from: Node,
+    to: Node,
+    path: PartitionPath,
+    status: PartitionChangeStatus
+  }
+
+  PartitionChangeStatus = Awaiting | Complete
+
+
+Membership Lifecycle
+--------------------
+
+A node begins in the ``joining`` state. Once all nodes have seen that the new
+node is joining (through gossip convergence) the ``leader`` will set the member
+state to ``up`` and can start assigning partitions to the new node.
+
+If a node is leaving the cluster in a safe, expected manner then it switches to
+the ``leaving`` state. The ``leader`` will reassign partitions across the cluster
+(it is possible for a leaving node to itself be the ``leader``). When all partition
+handoff has completed then the node will change to the ``exiting`` state. Once
+all nodes have seen the exiting state (convergence) the ``leader`` will remove the
+node from the cluster, marking it as ``removed``.
+
+If a node is unreachable then gossip convergence is not possible and therefore
+any ``leader`` actions are also not possible (for instance, allowing a node to
+become a part of the cluster, or changing actor distribution). To be able to
+move forward the state of the unreachable nodes must be changed. If the
+unreachable node is experiencing only transient difficulties then it can be
+explicitly marked as ``down`` using the ``down`` user action. When this node
+comes back up and begins gossiping it will automatically go through the joining
+process again. If the unreachable node will be permanently down then it can be
+removed from the cluster directly by shutting the actor system down or killing it
+through an external ``SIGKILL`` signal, invocation of ``System.exit(status)`` or
+similar. The cluster can, through the leader, also *auto-down* a node.
+
+This means that nodes can join and leave the cluster at any point in time, i.e.
+provide cluster elasticity.
+
+
+State Diagram for the Member States
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. image:: images/member-states.png
+
+
+Member States
+^^^^^^^^^^^^^
+
+- **joining**
+    transient state when joining a cluster
+
+- **up**
+    normal operating state
+
+- **leaving** / **exiting**
+    states during graceful removal
+
+- **down**
+    marked as down/offline/unreachable
+
+- **removed**
+    tombstone state (no longer a member)
+
+
+User Actions
+^^^^^^^^^^^^
+
+- **join**
+    join a single node to a cluster - can be explicit or automatic on
+    startup if a node to join have been specified in the configuration
+
+- **leave**
+    tell a node to leave the cluster gracefully
+
+- **down**
+    mark a node as temporarily down
+
+
+Leader Actions
+^^^^^^^^^^^^^^
+
+The ``leader`` has the following duties:
+
+- shifting members in and out of the cluster
+
+  - joining -> up
+
+  - exiting -> removed
+
+- partition distribution
+
+  - scheduling handoffs (pending changes)
+
+  - setting the partition table (partition path -> base node)
+
+  - Automatic rebalancing based on runtime metrics in the system (such as CPU,
+    RAM, Garbage Collection, mailbox depth etc.)
+
+
+Partitioning
+============
+
+Each partition (an actor or actor subtree) in the actor system is assigned to a
+set of nodes in the cluster. The actor at the head of the partition is referred
+to as the partition point. The mapping from partition path (actor address of the
+format "a/b/c") to instance nodes is stored in the partition table and is
+maintained as part of the cluster state through the gossip protocol. The
+partition table is only updated by the ``leader`` node. Currently the only possible
+partition points are *routed* actors.
+
+Routed actors can have an instance count greater than one. The instance count is
+also referred to as the ``N-value``. If the ``N-value`` is greater than one then
+a set of instance nodes will be given in the partition table.
+
+Note that in the first implementation there may be a restriction such that only
+top-level partitions are possible (the highest possible partition points are
+used and sub-partitioning is not allowed). Still to be explored in more detail.
+
+The cluster ``leader`` determines the current instance count for a partition based
+on two axes: fault-tolerance and scaling.
+
+Fault-tolerance determines a minimum number of instances for a routed actor
+(allowing N-1 nodes to crash while still maintaining at least one running actor
+instance). The user can specify a function from current number of nodes to the
+number of acceptable node failures: n: Int => f: Int where f < n.
+
+Scaling reflects the number of instances needed to maintain good throughput and
+is influenced by metrics from the system, particularly a history of mailbox
+size, CPU load, and GC percentages. It may also be possible to accept scaling
+hints from the user that indicate expected load.
+
+The balancing of partitions can be determined in a very simple way in the first
+implementation, where the overlap of partitions is minimized. Partitions are
+spread over the cluster ring in a circular fashion, with each instance node in
+the first available space. For example, given a cluster with ten nodes and three
+partitions, A, B, and C, having N-values of 4, 3, and 5; partition A would have
+instances on nodes 1-4; partition B would have instances on nodes 5-7; partition
+C would have instances on nodes 8-10 and 1-2. The only overlap is on nodes 1 and
+2.
+
+The distribution of partitions is not limited, however, to having instances on
+adjacent nodes in the sorted ring order. Each instance can be assigned to any
+node and the more advanced load balancing algorithms will make use of this. The
+partition table contains a mapping from path to instance nodes. The partitioning
+for the above example would be::
+
+   A -> { 1, 2, 3, 4 }
+   B -> { 5, 6, 7 }
+   C -> { 8, 9, 10, 1, 2 }
+
+If 5 new nodes join the cluster and in sorted order these nodes appear after the
+current nodes 2, 4, 5, 7, and 8, then the partition table could be updated to
+the following, with all instances on the same physical nodes as before::
+
+   A -> { 1, 2, 4, 5 }
+   B -> { 7, 9, 10 }
+   C -> { 12, 14, 15, 1, 2 }
+
+When rebalancing is required the ``leader`` will schedule handoffs, gossiping a set
+of pending changes, and when each change is complete the ``leader`` will update the
+partition table.
+
+
+Handoff
+-------
+
+Handoff for an actor-based system is different than for a data-based system. The
+most important point is that message ordering (from a given node to a given
+actor instance) may need to be maintained. If an actor is a singleton actor
+(only one instance possible throughout the cluster) then the cluster may also
+need to assure that there is only one such actor active at any one time. Both of
+these situations can be handled by forwarding and buffering messages during
+transitions.
+
+A *graceful handoff* (one where the previous host node is up and running during
+the handoff), given a previous host node ``N1``, a new host node ``N2``, and an
+actor partition ``A`` to be migrated from ``N1`` to ``N2``, has this general
+structure:
+
+  1. the ``leader`` sets a pending change for ``N1`` to handoff ``A`` to ``N2``
+
+  2. ``N1`` notices the pending change and sends an initialization message to ``N2``
+
+  3. in response ``N2`` creates ``A`` and sends back a ready message
+
+  4. after receiving the ready message ``N1`` marks the change as
+     complete and shuts down ``A``
+
+  5. the ``leader`` sees the migration is complete and updates the partition table
+
+  6. all nodes eventually see the new partitioning and use ``N2``
+
+
+Transitions
+^^^^^^^^^^^
+
+There are transition times in the handoff process where different approaches can
+be used to give different guarantees.
+
+
+Migration Transition
+~~~~~~~~~~~~~~~~~~~~
+
+The first transition starts when ``N1`` initiates the moving of ``A`` and ends
+when ``N1`` receives the ready message, and is referred to as the *migration
+transition*.
+
+The first question is; during the migration transition, should:
+
+- ``N1`` continue to process messages for ``A``?
+
+- Or is it important that no messages for ``A`` are processed on
+  ``N1`` once migration begins?
+
+If it is okay for the previous host node ``N1`` to process messages during
+migration then there is nothing that needs to be done at this point.
+
+If no messages are to be processed on the previous host node during migration
+then there are two possibilities: the messages are forwarded to the new host and
+buffered until the actor is ready, or the messages are simply dropped by
+terminating the actor and allowing the normal dead letter process to be used.
+
+
+Update Transition
+~~~~~~~~~~~~~~~~~
+
+The second transition begins when the migration is marked as complete and ends
+when all nodes have the updated partition table (when all nodes will use ``N2``
+as the host for ``A``, i.e. we have convergence) and is referred to as the
+*update transition*.
+
+Once the update transition begins ``N1`` can forward any messages it receives
+for ``A`` to the new host ``N2``. The question is whether or not message
+ordering needs to be preserved. If messages sent to the previous host node
+``N1`` are being forwarded, then it is possible that a message sent to ``N1``
+could be forwarded after a direct message to the new host ``N2``, breaking
+message ordering from a client to actor ``A``.
+
+In this situation ``N2`` can keep a buffer for messages per sending node. Each
+buffer is flushed and removed when an acknowledgement (``ack``) message has been
+received. When each node in the cluster sees the partition update it first sends
+an ``ack`` message to the previous host node ``N1`` before beginning to use
+``N2`` as the new host for ``A``. Any messages sent from the client node
+directly to ``N2`` will be buffered. ``N1`` can count down the number of acks to
+determine when no more forwarding is needed. The ``ack`` message from any node
+will always follow any other messages sent to ``N1``. When ``N1`` receives the
+``ack`` message it also forwards it to ``N2`` and again this ``ack`` message
+will follow any other messages already forwarded for ``A``. When ``N2`` receives
+an ``ack`` message, the buffer for the sending node can be flushed and removed.
+Any subsequent messages from this sending node can be queued normally. Once all
+nodes in the cluster have acknowledged the partition change and ``N2`` has
+cleared all buffers, the handoff is complete and message ordering has been
+preserved. In practice the buffers should remain small as it is only those
+messages sent directly to ``N2`` before the acknowledgement has been forwarded
+that will be buffered.
+
+
+Graceful Handoff
+^^^^^^^^^^^^^^^^
+
+A more complete process for graceful handoff would be:
+
+  1. the ``leader`` sets a pending change for ``N1`` to handoff ``A`` to ``N2``
+
+
+  2. ``N1`` notices the pending change and sends an initialization message to
+     ``N2``. Options:
+
+     a. keep ``A`` on ``N1`` active and continuing processing messages as normal
+
+     b. ``N1`` forwards all messages for ``A`` to ``N2``
+
+     c. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
+        becoming dead letters)
+
+
+  3. in response ``N2`` creates ``A`` and sends back a ready message. Options:
+
+     a. ``N2`` simply processes messages for ``A`` as normal
+
+     b. ``N2`` creates a buffer per sending node for ``A``. Each buffer is
+        opened (flushed and removed) when an acknowledgement for the sending
+        node has been received (via ``N1``)
+
+
+  4. after receiving the ready message ``N1`` marks the change as complete. Options:
+
+     a. ``N1`` forwards all messages for ``A`` to ``N2`` during the update transition
+
+     b. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
+        becoming dead letters)
+
+
+  5. the ``leader`` sees the migration is complete and updates the partition table
+
+
+  6. all nodes eventually see the new partitioning and use ``N2``
+
+     i. each node sends an acknowledgement message to ``N1``
+
+     ii. when ``N1`` receives the acknowledgement it can count down the pending
+         acknowledgements and remove forwarding when complete
+
+     iii. when ``N2`` receives the acknowledgement it can open the buffer for the
+          sending node (if buffers are used)
+
+
+The default approach is to take options 2a, 3a, and 4a - allowing ``A`` on
+``N1`` to continue processing messages during migration and then forwarding any
+messages during the update transition. This assumes stateless actors that do not
+have a dependency on message ordering from any given source.
+
+- If an actor has a distributed durable mailbox then nothing needs to be done,
+  other than migrating the actor.
+
+- If message ordering needs to be maintained during the update transition then
+  option 3b can be used, creating buffers per sending node.
+
+- If the actors are robust to message send failures then the dropping messages
+  approach can be used (with no forwarding or buffering needed).
+
+- If an actor is a singleton (only one instance possible throughout the cluster)
+  and state is transferred during the migration initialization, then options 2b
+  and 3b would be required.
+
+
+Stateful Actor Replication
+==========================
+
+Support for stateful singleton actors will come in future releases of Akka, and
+is scheduled for Akka 2.2. Having a Dynamo base for the clustering already we
+should use the same infrastructure to provide stateful actor clustering and
+datastore as well. The stateful actor clustering should be layered on top of the
+distributed datastore. See the next section for a rough outline on how the
+distributed datastore could be implemented.
+
+
+Implementing a Dynamo-style Distributed Database on top of Akka Cluster
+-----------------------------------------------------------------------
+
+The missing pieces to implement a full Dynamo-style eventually consistent data
+storage on top of the Akka Cluster as described in this document are:
+
+- Configuration of ``READ`` and ``WRITE`` consistency levels according to the
+  ``N/R/W`` numbers defined in the Dynamo paper.
+
+    - R = read replica count
+
+    - W = write replica count
+
+    - N = replication factor
+
+    - Q = QUORUM = N / 2 + 1
+
+    - W + R > N = full consistency
+
+- Define a versioned data message wrapper::
+
+    Versioned[T](hash: Long, version: VectorClock, data: T)
+
+- Define a single system data broker actor on each node that uses a ``Consistent
+  Hashing Router`` and that have instances on all other nodes in the node ring.
+
+- For ``WRITE``:
+
+    1. Wrap data in a ``Versioned Message``
+
+    2. Send a ``Versioned Message`` with the data is sent to a number of nodes
+       matching the ``W-value``.
+
+- For ``READ``:
+
+    1. Read in the ``Versioned Message`` with the data from as many replicas as
+       you need for the consistency level required by the ``R-value``.
+
+    2. Do comparison on the versions (using `Vector Clocks`_)
+
+    3. If the versions differ then do `Read Repair`_ to update the inconsistent
+       nodes.
+
+    4. Return the latest versioned data.
+
+.. _Read Repair: http://wiki.apache.org/cassandra/ReadRepair
--- a/akka-docs/rst/cluster/images/member-states.png
+++ b/akka-docs/rst/cluster/images/member-states.png
--- a/akka-docs/rst/cluster/images/more.png
+++ b/akka-docs/rst/cluster/images/more.png
--- a/akka-docs/rst/cluster/index.rst
+++ b/akka-docs/rst/cluster/index.rst
@ -0,0 +1,8 @@
+Cluster
+=======
+
+.. toctree::
+   :maxdepth: 2
+
+   cluster
+   cluster-usage