Merge pull request #17881 from akka/wip-16800-niy-patriknw

=clu #16800 Remove NIY sections of Cluster Specification
This commit is contained in:
Patrik Nordwall 2015-07-02 20:00:01 +02:00
commit 89046f7f18

View file

@ -5,15 +5,6 @@
######################
.. note:: This document describes the design concepts of the clustering.
It is divided into two parts, where the first part describes what is
currently implemented and the second part describes what is planned as
future enhancements/additions. References to unimplemented parts have
been marked with the footnote :ref:`[*] <niy>`
The Current Cluster
*******************
Intro
=====
@ -34,9 +25,8 @@ Terms
A set of nodes joined together through the `membership`_ service.
**leader**
A single node in the cluster that acts as the leader. Managing cluster convergence,
partitions :ref:`[*] <niy>`, fail-over :ref:`[*] <niy>`, rebalancing :ref:`[*] <niy>`
etc.
A single node in the cluster that acts as the leader. Managing cluster convergence
and membership state transitions.
Membership
@ -44,8 +34,8 @@ Membership
A cluster is made up of a set of member nodes. The identifier for each node is a
``hostname:port:uid`` tuple. An Akka application can be distributed over a cluster with
each node hosting some part of the application. Cluster membership and partitioning
:ref:`[*] <niy>` of the application are decoupled. A node could be a member of a
each node hosting some part of the application. Cluster membership and the actors running
on that node of the application are decoupled. A node could be a member of a
cluster without hosting any actors. Joining a cluster is initiated
by issuing a ``Join`` command to one of the nodes in the cluster to join.
@ -349,304 +339,3 @@ Failure Detection and Unreachability
again and thereby remove the flag
Future Cluster Enhancements and Additions
*****************************************
Goal
====
In addition to membership also provide automatic partitioning :ref:`[*] <niy>`,
handoff :ref:`[*] <niy>`, and cluster rebalancing :ref:`[*] <niy>` of actors.
Additional Terms
================
These additional terms are used in this section.
**partition** :ref:`[*] <niy>`
An actor or subtree of actors in the Akka application that is distributed
within the cluster.
**partition point** :ref:`[*] <niy>`
The actor at the head of a partition. The point around which a partition is
formed.
**partition path** :ref:`[*] <niy>`
Also referred to as the actor address. Has the format `actor1/actor2/actor3`
**instance count** :ref:`[*] <niy>`
The number of instances of a partition in the cluster. Also referred to as the
``N-value`` of the partition.
**instance node** :ref:`[*] <niy>`
A node that an actor instance is assigned to.
**partition table** :ref:`[*] <niy>`
A mapping from partition path to a set of instance nodes (where the nodes are
referred to by the ordinal position given the nodes in sorted order).
Partitioning :ref:`[*] <niy>`
=============================
.. note:: Actor partitioning is not implemented yet.
Each partition (an actor or actor subtree) in the actor system is assigned to a
set of nodes in the cluster. The actor at the head of the partition is referred
to as the partition point. The mapping from partition path (actor address of the
format "a/b/c") to instance nodes is stored in the partition table and is
maintained as part of the cluster state through the gossip protocol. The
partition table is only updated by the ``leader`` node. Currently the only possible
partition points are *routed* actors.
Routed actors can have an instance count greater than one. The instance count is
also referred to as the ``N-value``. If the ``N-value`` is greater than one then
a set of instance nodes will be given in the partition table.
Note that in the first implementation there may be a restriction such that only
top-level partitions are possible (the highest possible partition points are
used and sub-partitioning is not allowed). Still to be explored in more detail.
The cluster ``leader`` determines the current instance count for a partition based
on two axes: fault-tolerance and scaling.
Fault-tolerance determines a minimum number of instances for a routed actor
(allowing N-1 nodes to crash while still maintaining at least one running actor
instance). The user can specify a function from current number of nodes to the
number of acceptable node failures: n: Int => f: Int where f < n.
Scaling reflects the number of instances needed to maintain good throughput and
is influenced by metrics from the system, particularly a history of mailbox
size, CPU load, and GC percentages. It may also be possible to accept scaling
hints from the user that indicate expected load.
The balancing of partitions can be determined in a very simple way in the first
implementation, where the overlap of partitions is minimized. Partitions are
spread over the cluster ring in a circular fashion, with each instance node in
the first available space. For example, given a cluster with ten nodes and three
partitions, A, B, and C, having N-values of 4, 3, and 5; partition A would have
instances on nodes 1-4; partition B would have instances on nodes 5-7; partition
C would have instances on nodes 8-10 and 1-2. The only overlap is on nodes 1 and
2.
The distribution of partitions is not limited, however, to having instances on
adjacent nodes in the sorted ring order. Each instance can be assigned to any
node and the more advanced load balancing algorithms will make use of this. The
partition table contains a mapping from path to instance nodes. The partitioning
for the above example would be::
A -> { 1, 2, 3, 4 }
B -> { 5, 6, 7 }
C -> { 8, 9, 10, 1, 2 }
If 5 new nodes join the cluster and in sorted order these nodes appear after the
current nodes 2, 4, 5, 7, and 8, then the partition table could be updated to
the following, with all instances on the same physical nodes as before::
A -> { 1, 2, 4, 5 }
B -> { 7, 9, 10 }
C -> { 12, 14, 15, 1, 2 }
When rebalancing is required the ``leader`` will schedule handoffs, gossiping a set
of pending changes, and when each change is complete the ``leader`` will update the
partition table.
Additional Leader Responsibilities
----------------------------------
After moving a member from joining to up, the leader can start assigning partitions
:ref:`[*] <niy>` to the new node, and when a node is ``leaving`` the ``leader`` will
reassign partitions :ref:`[*] <niy>` across the cluster (it is possible for a leaving
node to itself be the ``leader``). When all partition handoff :ref:`[*] <niy>` has
completed then the node will change to the ``exiting`` state.
On convergence the leader can schedule rebalancing across the cluster,
but it may also be possible for the user to explicitly rebalance the
cluster by specifying migrations :ref:`[*] <niy>`, or to rebalance :ref:`[*] <niy>`
the cluster automatically based on metrics from member nodes. Metrics may be spread
using the gossip protocol or possibly more efficiently using a *random chord* method,
where the ``leader`` contacts several random nodes around the cluster ring and each
contacted node gathers information from their immediate neighbours, giving a random
sampling of load information.
Handoff
-------
Handoff for an actor-based system is different than for a data-based system. The
most important point is that message ordering (from a given node to a given
actor instance) may need to be maintained. If an actor is a singleton actor
(only one instance possible throughout the cluster) then the cluster may also
need to assure that there is only one such actor active at any one time. Both of
these situations can be handled by forwarding and buffering messages during
transitions.
A *graceful handoff* (one where the previous host node is up and running during
the handoff), given a previous host node ``N1``, a new host node ``N2``, and an
actor partition ``A`` to be migrated from ``N1`` to ``N2``, has this general
structure:
1. the ``leader`` sets a pending change for ``N1`` to handoff ``A`` to ``N2``
2. ``N1`` notices the pending change and sends an initialization message to ``N2``
3. in response ``N2`` creates ``A`` and sends back a ready message
4. after receiving the ready message ``N1`` marks the change as
complete and shuts down ``A``
5. the ``leader`` sees the migration is complete and updates the partition table
6. all nodes eventually see the new partitioning and use ``N2``
Transitions
^^^^^^^^^^^
There are transition times in the handoff process where different approaches can
be used to give different guarantees.
Migration Transition
~~~~~~~~~~~~~~~~~~~~
The first transition starts when ``N1`` initiates the moving of ``A`` and ends
when ``N1`` receives the ready message, and is referred to as the *migration
transition*.
The first question is; during the migration transition, should:
- ``N1`` continue to process messages for ``A``?
- Or is it important that no messages for ``A`` are processed on
``N1`` once migration begins?
If it is okay for the previous host node ``N1`` to process messages during
migration then there is nothing that needs to be done at this point.
If no messages are to be processed on the previous host node during migration
then there are two possibilities: the messages are forwarded to the new host and
buffered until the actor is ready, or the messages are simply dropped by
terminating the actor and allowing the normal dead letter process to be used.
Update Transition
~~~~~~~~~~~~~~~~~
The second transition begins when the migration is marked as complete and ends
when all nodes have the updated partition table (when all nodes will use ``N2``
as the host for ``A``, i.e. we have convergence) and is referred to as the
*update transition*.
Once the update transition begins ``N1`` can forward any messages it receives
for ``A`` to the new host ``N2``. The question is whether or not message
ordering needs to be preserved. If messages sent to the previous host node
``N1`` are being forwarded, then it is possible that a message sent to ``N1``
could be forwarded after a direct message to the new host ``N2``, breaking
message ordering from a client to actor ``A``.
In this situation ``N2`` can keep a buffer for messages per sending node. Each
buffer is flushed and removed when an acknowledgement (``ack``) message has been
received. When each node in the cluster sees the partition update it first sends
an ``ack`` message to the previous host node ``N1`` before beginning to use
``N2`` as the new host for ``A``. Any messages sent from the client node
directly to ``N2`` will be buffered. ``N1`` can count down the number of acks to
determine when no more forwarding is needed. The ``ack`` message from any node
will always follow any other messages sent to ``N1``. When ``N1`` receives the
``ack`` message it also forwards it to ``N2`` and again this ``ack`` message
will follow any other messages already forwarded for ``A``. When ``N2`` receives
an ``ack`` message, the buffer for the sending node can be flushed and removed.
Any subsequent messages from this sending node can be queued normally. Once all
nodes in the cluster have acknowledged the partition change and ``N2`` has
cleared all buffers, the handoff is complete and message ordering has been
preserved. In practice the buffers should remain small as it is only those
messages sent directly to ``N2`` before the acknowledgement has been forwarded
that will be buffered.
Graceful Handoff
^^^^^^^^^^^^^^^^
A more complete process for graceful handoff would be:
1. the ``leader`` sets a pending change for ``N1`` to handoff ``A`` to ``N2``
2. ``N1`` notices the pending change and sends an initialization message to
``N2``. Options:
a. keep ``A`` on ``N1`` active and continuing processing messages as normal
b. ``N1`` forwards all messages for ``A`` to ``N2``
c. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
becoming dead letters)
3. in response ``N2`` creates ``A`` and sends back a ready message. Options:
a. ``N2`` simply processes messages for ``A`` as normal
b. ``N2`` creates a buffer per sending node for ``A``. Each buffer is
opened (flushed and removed) when an acknowledgement for the sending
node has been received (via ``N1``)
4. after receiving the ready message ``N1`` marks the change as complete. Options:
a. ``N1`` forwards all messages for ``A`` to ``N2`` during the update transition
b. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
becoming dead letters)
5. the ``leader`` sees the migration is complete and updates the partition table
6. all nodes eventually see the new partitioning and use ``N2``
i. each node sends an acknowledgement message to ``N1``
ii. when ``N1`` receives the acknowledgement it can count down the pending
acknowledgements and remove forwarding when complete
iii. when ``N2`` receives the acknowledgement it can open the buffer for the
sending node (if buffers are used)
The default approach is to take options 2a, 3a, and 4a - allowing ``A`` on
``N1`` to continue processing messages during migration and then forwarding any
messages during the update transition. This assumes stateless actors that do not
have a dependency on message ordering from any given source.
- If an actor has persistent (durable) state then nothing needs to be done,
other than migrating the actor.
- If message ordering needs to be maintained during the update transition then
option 3b can be used, creating buffers per sending node.
- If the actors are robust to message send failures then the dropping messages
approach can be used (with no forwarding or buffering needed).
- If an actor is a singleton (only one instance possible throughout the cluster)
and state is transferred during the migration initialization, then options 2b
and 3b would be required.
Stateful Actor Replication :ref:`[*] <niy>`
===========================================
.. note:: Stateful actor replication is not implemented yet.
.. _niy:
[*] Not Implemented Yet
=======================
* Actor partitioning
* Actor handoff
* Actor rebalancing
* Stateful actor replication