Some docs for new clustering

2011-10-25 16:28:12 +02:00 · 2011-10-25 16:28:12 +02:00 · 80250cd884
commit 80250cd884
parent 173ef048ce
3 changed files with 365 additions and 0 deletions
--- a/akka-docs/cluster/images/member-states.png
+++ b/akka-docs/cluster/images/member-states.png
--- a/akka-docs/cluster/cluster/more.png
+++ b/akka-docs/cluster/cluster/more.png
--- a/akka-docs/cluster/new.rst
+++ b/akka-docs/cluster/new.rst
@ -0,0 +1,365 @@
+
+.. _cluster:
+
+################
+ New Clustering
+################
+
+
+Membership
+==========
+
+A cluster is made up of a set of member nodes. The identifier for each node is a
+host:port pair. An Akka application is distributed over a cluster with each node
+hosting some part of the application. Cluster membership and partitioning of the
+application are decoupled. A node could be a member of a cluster without hosting
+any actors.
+
+
+Gossip
+------
+
+The cluster membership used in Akka is based on the Dynamo system and
+particularly the approach taken in Riak. Cluster membership is communicated
+using a gossip protocol. The current state of the cluster is gossiped randomly
+through the cluster. Joining a cluster is initiated by specifying a set of seed
+nodes with which to begin gossiping.
+
+TODO: More details about the gossip approach (push-pull-gossip?). Gossiping to a
+random member node, random unreachable node, random seed node.
+
+
+Vector Clocks
+-------------
+
+Vector clocks are used to reconcile and merge differences in cluster state
+during gossiping. A vector clock is a set of (node, counter) pairs. Each update
+to the cluster state has an accompanying update to the vector clock.
+
+
+Gossip convergence
+------------------
+
+Information about the cluster converges at certain points of time. This is when
+all nodes have seen the same cluster state. To be able to recognise this
+convergence a map from node to current vector clock is also passed as part of
+the gossip state. Gossip convergence cannot occur while any nodes are
+unreachable, either the nodes become reachable again, or the nodes need to be
+moved into the ``down`` or ``removed`` states (see below).
+
+
+Leader
+------
+
+After gossip convergence a leader for the cluster can be determined. There is no
+leader election process, the leader can always be recognised deterministically
+by any node whenever there is gossip convergence. The leader is simply the first
+node in sorted order that is able to take the leadership role, where the only
+allowed member states for a leader are ``up`` or ``leaving``.
+
+The role of the leader is to shift members in and out of the cluster, changing
+``joining`` members to the ``up`` state or ``exiting`` members to the
+``removed`` state, and to schedule rebalancing across the cluster. Currently
+leader actions are only triggered by receiving a new cluster state with gossip
+convergence but it may also be possible for the user to explicitly rebalance the
+cluster by specifying migrations, or to rebalance the cluster automatically
+based on metrics gossiped by the member nodes.
+
+
+Membership lifecycle
+--------------------
+
+A node begins in the ``joining`` state. Once all nodes have seen that the new
+node is joining (through gossip convergence) the leader will set the member
+state to ``up`` and can start assigning partitions to the new node.
+
+If a node is leaving the cluster in a safe, expected manner then it switches to
+the ``leaving`` state. The leader will reassign partitions across the cluster
+(it is possible for a leaving node to itself be the leader). When all partition
+handoff has completed then the node will change to the ``exiting`` state. Once
+all nodes have seen the exiting state (convergence) the leader will remove the
+node from the cluster, marking it as ``removed``.
+
+A node can also be removed forcefully by moving it directly to the ``removed``
+state using the ``remove`` action. The cluster will rebalance based on the new
+cluster membership.
+
+If a node is unreachable then gossip convergence is not possible and therefore
+any leader actions are also not possible (for instance, allowing a node to
+become a part of the cluster, or changing actor distribution). To be able to
+move forward the state of the unreachable nodes must be changed. If the
+unreachable node is experiencing only transient difficulties then it can be
+explicitly marked as ``down`` using the ``down`` user action. When this node
+comes back up and begins gossiping it will automatically go through the joining
+process again. If the unreachable node will be permanently down then it can be
+removed from the cluster directly with the ``remove`` user action. The cluster
+can also *auto-down* a node using the accrual failure detector.
+
+TODO: more information about the accrual failure detection and auto-downing
+
+
+State diagram for the member states
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. image:: images/member-states.png
+
+
+Member states
+^^^^^^^^^^^^^
+
+- **joining**
+    transient state when joining a cluster
+
+- **up**
+    normal operating state
+
+- **leaving** / **exiting**
+    states during graceful removal
+
+- **removed**
+    tombstone state (no longer a member)
+
+- **down**
+    marked as down/offline/unreachable
+
+
+User actions
+^^^^^^^^^^^^
+
+- **join**
+    join a single node to a cluster - can be explicit or automatic on
+    startup if a list of seed nodes have been specified in the configuration
+
+- **leave**
+    tell a node to leave the cluster gracefully
+
+- **down**
+    mark a node as temporarily down
+
+- **remove**
+    remove a node from the cluster immediately
+
+
+Leader actions
+^^^^^^^^^^^^^^
+
+- shifting members in and out of the cluster
+
+  - joining -> up
+
+  - exiting -> removed
+
+- partition distribution
+
+  - scheduling handoffs (pending changes)
+
+  - setting the partition table (partition path -> base node)
+
+
+Partitioning
+============
+
+Each partition (an actor or actor subtree) in the actor system is assigned to a
+base node. The mapping from partition path (actor address) to base node is
+stored in the partition table and is maintained as part of the cluster state
+through the gossip protocol. The partition table is only updated by the leader
+node. If the partition has a configured instance count (N value) greater than
+one, then the location of the other instances can be found deterministically by
+counting from the base node. The first instance will be found on the base node,
+and the other instances on the next N-1 nodes, given the nodes in sorted order.
+
+TODO: discuss how different N values within the tree work (especially subtrees
+with a greater or lesser N value). A simple implementation would only allow the
+highest-up-the-tree, non-singular (greater than one) value to be used for any
+subtree.
+
+When rebalancing is required the leader will schedule handoffs, gossiping a set
+of pending changes, and when each change is complete the leader will update the
+partition table.
+
+TODO: look further into how actors will be distributed and also avoiding
+unnecessary migrations just to create a more balanced cluster.
+
+
+Handoff
+-------
+
+Handoff for an actor-based system is different than for a data-based system. The
+most important point is that message ordering (from a given node to a given
+actor) may need to be maintained. If an actor is a singleton actor then the
+cluster may also need to assure that there is only one such actor active at any
+one time. Both of these situations can be handled by forwarding and buffering
+messages during transitions.
+
+A *graceful handoff* (one where the previous host node is up and running during
+the handoff), given a previous host node ``N1``, a new host node ``N2``, and an
+actor partition ``A`` to be migrated from ``N1`` to ``N2``, has this general
+structure:
+
+  1. the leader sets a pending change for ``N1`` to handoff ``A`` to ``N2``
+
+  2. ``N1`` notices the pending change and sends an initialization message to ``N2``
+
+  3. in response ``N2`` creates ``A`` and sends back a ready message
+
+  4. after receiving the ready message ``N1`` marks the change as complete
+
+  5. the leader sees the migration is complete and updates the partition table
+
+  6. all nodes eventually see the new partitioning and use ``N2``
+
+
+Transitions
+^^^^^^^^^^^
+
+There are transition times in the handoff process where different approaches can
+be used to give different guarantees.
+
+
+Migration transition
+~~~~~~~~~~~~~~~~~~~~
+
+The first transition starts when ``N1`` initiates the moving of ``A`` and ends
+when ``N1`` receives the ready message, and is referred to as the *migration
+transition*.
+
+The first question is: during the migration transition should ``N1`` continue to
+process messages for ``A``? Or is it important that no messages for ``A`` are
+processed on ``N1`` once migration begins?
+
+If it is okay for the previous host node to process messages during migration
+then there is nothing that needs to be done at this point.
+
+If no messages are to be processed on the previous host node during migration
+then there are two possibilities: the messages are forwarded to the new host and
+buffered until the actor is ready, or the messages are simply dropped by
+terminating the actor and allowing the normal dead letter process to be used.
+
+
+Update transition
+~~~~~~~~~~~~~~~~~
+
+The second transition begins when the migration is marked as complete and ends
+when all nodes have the updated partition table (when all nodes will use ``N2``
+as the host for ``A``), and is referred to as the *update transition*.
+
+Once the update transition begins ``N1`` can forward any messages it receives
+for ``A`` to the new host ``N2``. The question is whether or not message
+ordering needs to be preserved. If messages sent to the previous host node
+``N1`` are being forwarded, then it is possible that a message sent to ``N1``
+could be forwarded after a direct message to the new host ``N2``, breaking
+message ordering from a client to actor ``A``.
+
+In this situation ``N2`` can keep a buffer for messages per sending node. Each
+buffer is flushed and removed when an acknowledgement has been received. When
+each node in the cluster sees the partition update it first sends an ack message
+to the previous host node ``N1`` before beginning to use ``N2`` as the new host
+for ``A``. Any messages sent from the client node directly to ``N2`` will be
+buffered. ``N1`` can count down the number of acks to determine when no more
+forwarding is needed. The ack message from any node will always follow any other
+messages sent to ``N1``. When ``N1`` receives the ack message it also forwards
+it to ``N2`` and again this ack message will follow any other messages already
+forwarded for ``A``. When ``N2`` receives an ack message the buffer for the
+sending node can be flushed and removed. Any subsequent messages from this
+sending node can be queued normally. Once all nodes in the cluster have
+acknowledged the partition change and ``N2`` has cleared all buffers, the
+handoff is complete and message ordering has been preserved. In practice the
+buffers should remain small as it is only those messages sent directly to ``N2``
+before the acknowledgement has been forwarded that will be buffered.
+
+
+Graceful handoff
+^^^^^^^^^^^^^^^^
+
+A more complete process for graceful handoff would be:
+
+  1. the leader sets a pending change for ``N1`` to handoff ``A`` to ``N2``
+
+
+  2. ``N1`` notices the pending change and sends an initialization message to
+     ``N2``. Options:
+
+     a. keep ``A`` on ``N1`` active and continuing processing messages as normal
+
+     b. ``N1`` forwards all messages for ``A`` to ``N2``
+
+     c. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
+        becoming dead letters)
+
+
+  3. in response ``N2`` creates ``A`` and sends back a ready message. Options:
+
+     a. ``N2`` simply processes messages for ``A`` as normal
+
+     b. ``N2`` creates a buffer per sending node for ``A``. Each buffer is
+        opened (flushed and removed) when an acknowledgement for the sending
+        node has been received (via ``N1``)
+
+
+  4. after receiving the ready message ``N1`` marks the change as complete. Options:
+
+     a. ``N1`` forwards all messages for ``A`` to ``N2`` during the update transition
+
+     b. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
+        becoming dead letters)
+
+
+  5. the leader sees the migration is complete and updates the partition table
+
+
+  6. all nodes eventually see the new partitioning and use ``N2``
+
+     i. each node sends an acknowledgement message to ``N1``
+
+     ii. when ``N1`` receives the acknowledgement it can count down the pending
+         acknowledgements and remove forwarding when complete
+
+     iii. when ``N2`` receives the acknowledgement it can open the buffer for the
+          sending node (if buffers are used)
+
+
+The default approach is to take options 2a, 3a, and 4a - allowing ``A`` on
+``N1`` to continue processing messages during migration and then forwarding any
+messages during the update transition. This assumes stateless actors that do not
+have a dependency on message ordering from any given source.
+
+If an actor has a distributed durable mailbox then nothing needs to be done,
+other than migrating the actor.
+
+If message ordering needs to be maintained during the update transition then
+option 3b can be used, creating buffers per sending node.
+
+If the actors are robust to message send failures then the dropping messages
+approach can be used (with no forwarding or buffering needed).
+
+If an actor is a singleton (only one instance possible throughout the cluster)
+and state is transfered during the migration initialization, then options 2b and
+3b would be required.
+
+
+Terms
+=====
+
+**node**
+  A logical member of a cluster. There could be multiple nodes on a physical
+  machine.
+
+**cluster**
+  A set of nodes. Contains distributed Akka applications.
+
+**partition**
+  An actor (possibly a subtree of actors) in the Akka application that
+  is distributed within the cluster.
+
+**partition path**
+  Also referred to as the actor address.
+
+**base node**
+  The first node (with nodes in sorted order) that contains a partition.
+
+**partition table**
+  A mapping from partition path to base node.
+
+**instance count**
+  The number of instances of a partition in the cluster. Also referred to as the
+  N value of the partition.