Merge pull request #17881 from akka/wip-16800-niy-patriknw
=clu #16800 Remove NIY sections of Cluster Specification
This commit is contained in:
commit
89046f7f18
1 changed files with 4 additions and 315 deletions
|
|
@ -5,15 +5,6 @@
|
|||
######################
|
||||
|
||||
.. note:: This document describes the design concepts of the clustering.
|
||||
It is divided into two parts, where the first part describes what is
|
||||
currently implemented and the second part describes what is planned as
|
||||
future enhancements/additions. References to unimplemented parts have
|
||||
been marked with the footnote :ref:`[*] <niy>`
|
||||
|
||||
|
||||
The Current Cluster
|
||||
*******************
|
||||
|
||||
|
||||
Intro
|
||||
=====
|
||||
|
|
@ -34,9 +25,8 @@ Terms
|
|||
A set of nodes joined together through the `membership`_ service.
|
||||
|
||||
**leader**
|
||||
A single node in the cluster that acts as the leader. Managing cluster convergence,
|
||||
partitions :ref:`[*] <niy>`, fail-over :ref:`[*] <niy>`, rebalancing :ref:`[*] <niy>`
|
||||
etc.
|
||||
A single node in the cluster that acts as the leader. Managing cluster convergence
|
||||
and membership state transitions.
|
||||
|
||||
|
||||
Membership
|
||||
|
|
@ -44,8 +34,8 @@ Membership
|
|||
|
||||
A cluster is made up of a set of member nodes. The identifier for each node is a
|
||||
``hostname:port:uid`` tuple. An Akka application can be distributed over a cluster with
|
||||
each node hosting some part of the application. Cluster membership and partitioning
|
||||
:ref:`[*] <niy>` of the application are decoupled. A node could be a member of a
|
||||
each node hosting some part of the application. Cluster membership and the actors running
|
||||
on that node of the application are decoupled. A node could be a member of a
|
||||
cluster without hosting any actors. Joining a cluster is initiated
|
||||
by issuing a ``Join`` command to one of the nodes in the cluster to join.
|
||||
|
||||
|
|
@ -349,304 +339,3 @@ Failure Detection and Unreachability
|
|||
again and thereby remove the flag
|
||||
|
||||
|
||||
Future Cluster Enhancements and Additions
|
||||
*****************************************
|
||||
|
||||
|
||||
Goal
|
||||
====
|
||||
|
||||
In addition to membership also provide automatic partitioning :ref:`[*] <niy>`,
|
||||
handoff :ref:`[*] <niy>`, and cluster rebalancing :ref:`[*] <niy>` of actors.
|
||||
|
||||
|
||||
Additional Terms
|
||||
================
|
||||
|
||||
These additional terms are used in this section.
|
||||
|
||||
**partition** :ref:`[*] <niy>`
|
||||
An actor or subtree of actors in the Akka application that is distributed
|
||||
within the cluster.
|
||||
|
||||
**partition point** :ref:`[*] <niy>`
|
||||
The actor at the head of a partition. The point around which a partition is
|
||||
formed.
|
||||
|
||||
**partition path** :ref:`[*] <niy>`
|
||||
Also referred to as the actor address. Has the format `actor1/actor2/actor3`
|
||||
|
||||
**instance count** :ref:`[*] <niy>`
|
||||
The number of instances of a partition in the cluster. Also referred to as the
|
||||
``N-value`` of the partition.
|
||||
|
||||
**instance node** :ref:`[*] <niy>`
|
||||
A node that an actor instance is assigned to.
|
||||
|
||||
**partition table** :ref:`[*] <niy>`
|
||||
A mapping from partition path to a set of instance nodes (where the nodes are
|
||||
referred to by the ordinal position given the nodes in sorted order).
|
||||
|
||||
|
||||
Partitioning :ref:`[*] <niy>`
|
||||
=============================
|
||||
|
||||
.. note:: Actor partitioning is not implemented yet.
|
||||
|
||||
Each partition (an actor or actor subtree) in the actor system is assigned to a
|
||||
set of nodes in the cluster. The actor at the head of the partition is referred
|
||||
to as the partition point. The mapping from partition path (actor address of the
|
||||
format "a/b/c") to instance nodes is stored in the partition table and is
|
||||
maintained as part of the cluster state through the gossip protocol. The
|
||||
partition table is only updated by the ``leader`` node. Currently the only possible
|
||||
partition points are *routed* actors.
|
||||
|
||||
Routed actors can have an instance count greater than one. The instance count is
|
||||
also referred to as the ``N-value``. If the ``N-value`` is greater than one then
|
||||
a set of instance nodes will be given in the partition table.
|
||||
|
||||
Note that in the first implementation there may be a restriction such that only
|
||||
top-level partitions are possible (the highest possible partition points are
|
||||
used and sub-partitioning is not allowed). Still to be explored in more detail.
|
||||
|
||||
The cluster ``leader`` determines the current instance count for a partition based
|
||||
on two axes: fault-tolerance and scaling.
|
||||
|
||||
Fault-tolerance determines a minimum number of instances for a routed actor
|
||||
(allowing N-1 nodes to crash while still maintaining at least one running actor
|
||||
instance). The user can specify a function from current number of nodes to the
|
||||
number of acceptable node failures: n: Int => f: Int where f < n.
|
||||
|
||||
Scaling reflects the number of instances needed to maintain good throughput and
|
||||
is influenced by metrics from the system, particularly a history of mailbox
|
||||
size, CPU load, and GC percentages. It may also be possible to accept scaling
|
||||
hints from the user that indicate expected load.
|
||||
|
||||
The balancing of partitions can be determined in a very simple way in the first
|
||||
implementation, where the overlap of partitions is minimized. Partitions are
|
||||
spread over the cluster ring in a circular fashion, with each instance node in
|
||||
the first available space. For example, given a cluster with ten nodes and three
|
||||
partitions, A, B, and C, having N-values of 4, 3, and 5; partition A would have
|
||||
instances on nodes 1-4; partition B would have instances on nodes 5-7; partition
|
||||
C would have instances on nodes 8-10 and 1-2. The only overlap is on nodes 1 and
|
||||
2.
|
||||
|
||||
The distribution of partitions is not limited, however, to having instances on
|
||||
adjacent nodes in the sorted ring order. Each instance can be assigned to any
|
||||
node and the more advanced load balancing algorithms will make use of this. The
|
||||
partition table contains a mapping from path to instance nodes. The partitioning
|
||||
for the above example would be::
|
||||
|
||||
A -> { 1, 2, 3, 4 }
|
||||
B -> { 5, 6, 7 }
|
||||
C -> { 8, 9, 10, 1, 2 }
|
||||
|
||||
If 5 new nodes join the cluster and in sorted order these nodes appear after the
|
||||
current nodes 2, 4, 5, 7, and 8, then the partition table could be updated to
|
||||
the following, with all instances on the same physical nodes as before::
|
||||
|
||||
A -> { 1, 2, 4, 5 }
|
||||
B -> { 7, 9, 10 }
|
||||
C -> { 12, 14, 15, 1, 2 }
|
||||
|
||||
When rebalancing is required the ``leader`` will schedule handoffs, gossiping a set
|
||||
of pending changes, and when each change is complete the ``leader`` will update the
|
||||
partition table.
|
||||
|
||||
|
||||
Additional Leader Responsibilities
|
||||
----------------------------------
|
||||
|
||||
After moving a member from joining to up, the leader can start assigning partitions
|
||||
:ref:`[*] <niy>` to the new node, and when a node is ``leaving`` the ``leader`` will
|
||||
reassign partitions :ref:`[*] <niy>` across the cluster (it is possible for a leaving
|
||||
node to itself be the ``leader``). When all partition handoff :ref:`[*] <niy>` has
|
||||
completed then the node will change to the ``exiting`` state.
|
||||
|
||||
On convergence the leader can schedule rebalancing across the cluster,
|
||||
but it may also be possible for the user to explicitly rebalance the
|
||||
cluster by specifying migrations :ref:`[*] <niy>`, or to rebalance :ref:`[*] <niy>`
|
||||
the cluster automatically based on metrics from member nodes. Metrics may be spread
|
||||
using the gossip protocol or possibly more efficiently using a *random chord* method,
|
||||
where the ``leader`` contacts several random nodes around the cluster ring and each
|
||||
contacted node gathers information from their immediate neighbours, giving a random
|
||||
sampling of load information.
|
||||
|
||||
|
||||
Handoff
|
||||
-------
|
||||
|
||||
Handoff for an actor-based system is different than for a data-based system. The
|
||||
most important point is that message ordering (from a given node to a given
|
||||
actor instance) may need to be maintained. If an actor is a singleton actor
|
||||
(only one instance possible throughout the cluster) then the cluster may also
|
||||
need to assure that there is only one such actor active at any one time. Both of
|
||||
these situations can be handled by forwarding and buffering messages during
|
||||
transitions.
|
||||
|
||||
A *graceful handoff* (one where the previous host node is up and running during
|
||||
the handoff), given a previous host node ``N1``, a new host node ``N2``, and an
|
||||
actor partition ``A`` to be migrated from ``N1`` to ``N2``, has this general
|
||||
structure:
|
||||
|
||||
1. the ``leader`` sets a pending change for ``N1`` to handoff ``A`` to ``N2``
|
||||
|
||||
2. ``N1`` notices the pending change and sends an initialization message to ``N2``
|
||||
|
||||
3. in response ``N2`` creates ``A`` and sends back a ready message
|
||||
|
||||
4. after receiving the ready message ``N1`` marks the change as
|
||||
complete and shuts down ``A``
|
||||
|
||||
5. the ``leader`` sees the migration is complete and updates the partition table
|
||||
|
||||
6. all nodes eventually see the new partitioning and use ``N2``
|
||||
|
||||
|
||||
Transitions
|
||||
^^^^^^^^^^^
|
||||
|
||||
There are transition times in the handoff process where different approaches can
|
||||
be used to give different guarantees.
|
||||
|
||||
|
||||
Migration Transition
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The first transition starts when ``N1`` initiates the moving of ``A`` and ends
|
||||
when ``N1`` receives the ready message, and is referred to as the *migration
|
||||
transition*.
|
||||
|
||||
The first question is; during the migration transition, should:
|
||||
|
||||
- ``N1`` continue to process messages for ``A``?
|
||||
|
||||
- Or is it important that no messages for ``A`` are processed on
|
||||
``N1`` once migration begins?
|
||||
|
||||
If it is okay for the previous host node ``N1`` to process messages during
|
||||
migration then there is nothing that needs to be done at this point.
|
||||
|
||||
If no messages are to be processed on the previous host node during migration
|
||||
then there are two possibilities: the messages are forwarded to the new host and
|
||||
buffered until the actor is ready, or the messages are simply dropped by
|
||||
terminating the actor and allowing the normal dead letter process to be used.
|
||||
|
||||
|
||||
Update Transition
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
The second transition begins when the migration is marked as complete and ends
|
||||
when all nodes have the updated partition table (when all nodes will use ``N2``
|
||||
as the host for ``A``, i.e. we have convergence) and is referred to as the
|
||||
*update transition*.
|
||||
|
||||
Once the update transition begins ``N1`` can forward any messages it receives
|
||||
for ``A`` to the new host ``N2``. The question is whether or not message
|
||||
ordering needs to be preserved. If messages sent to the previous host node
|
||||
``N1`` are being forwarded, then it is possible that a message sent to ``N1``
|
||||
could be forwarded after a direct message to the new host ``N2``, breaking
|
||||
message ordering from a client to actor ``A``.
|
||||
|
||||
In this situation ``N2`` can keep a buffer for messages per sending node. Each
|
||||
buffer is flushed and removed when an acknowledgement (``ack``) message has been
|
||||
received. When each node in the cluster sees the partition update it first sends
|
||||
an ``ack`` message to the previous host node ``N1`` before beginning to use
|
||||
``N2`` as the new host for ``A``. Any messages sent from the client node
|
||||
directly to ``N2`` will be buffered. ``N1`` can count down the number of acks to
|
||||
determine when no more forwarding is needed. The ``ack`` message from any node
|
||||
will always follow any other messages sent to ``N1``. When ``N1`` receives the
|
||||
``ack`` message it also forwards it to ``N2`` and again this ``ack`` message
|
||||
will follow any other messages already forwarded for ``A``. When ``N2`` receives
|
||||
an ``ack`` message, the buffer for the sending node can be flushed and removed.
|
||||
Any subsequent messages from this sending node can be queued normally. Once all
|
||||
nodes in the cluster have acknowledged the partition change and ``N2`` has
|
||||
cleared all buffers, the handoff is complete and message ordering has been
|
||||
preserved. In practice the buffers should remain small as it is only those
|
||||
messages sent directly to ``N2`` before the acknowledgement has been forwarded
|
||||
that will be buffered.
|
||||
|
||||
|
||||
Graceful Handoff
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
A more complete process for graceful handoff would be:
|
||||
|
||||
1. the ``leader`` sets a pending change for ``N1`` to handoff ``A`` to ``N2``
|
||||
|
||||
|
||||
2. ``N1`` notices the pending change and sends an initialization message to
|
||||
``N2``. Options:
|
||||
|
||||
a. keep ``A`` on ``N1`` active and continuing processing messages as normal
|
||||
|
||||
b. ``N1`` forwards all messages for ``A`` to ``N2``
|
||||
|
||||
c. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
|
||||
becoming dead letters)
|
||||
|
||||
|
||||
3. in response ``N2`` creates ``A`` and sends back a ready message. Options:
|
||||
|
||||
a. ``N2`` simply processes messages for ``A`` as normal
|
||||
|
||||
b. ``N2`` creates a buffer per sending node for ``A``. Each buffer is
|
||||
opened (flushed and removed) when an acknowledgement for the sending
|
||||
node has been received (via ``N1``)
|
||||
|
||||
|
||||
4. after receiving the ready message ``N1`` marks the change as complete. Options:
|
||||
|
||||
a. ``N1`` forwards all messages for ``A`` to ``N2`` during the update transition
|
||||
|
||||
b. ``N1`` drops all messages for ``A`` (terminate ``A`` with messages
|
||||
becoming dead letters)
|
||||
|
||||
|
||||
5. the ``leader`` sees the migration is complete and updates the partition table
|
||||
|
||||
|
||||
6. all nodes eventually see the new partitioning and use ``N2``
|
||||
|
||||
i. each node sends an acknowledgement message to ``N1``
|
||||
|
||||
ii. when ``N1`` receives the acknowledgement it can count down the pending
|
||||
acknowledgements and remove forwarding when complete
|
||||
|
||||
iii. when ``N2`` receives the acknowledgement it can open the buffer for the
|
||||
sending node (if buffers are used)
|
||||
|
||||
|
||||
The default approach is to take options 2a, 3a, and 4a - allowing ``A`` on
|
||||
``N1`` to continue processing messages during migration and then forwarding any
|
||||
messages during the update transition. This assumes stateless actors that do not
|
||||
have a dependency on message ordering from any given source.
|
||||
|
||||
- If an actor has persistent (durable) state then nothing needs to be done,
|
||||
other than migrating the actor.
|
||||
|
||||
- If message ordering needs to be maintained during the update transition then
|
||||
option 3b can be used, creating buffers per sending node.
|
||||
|
||||
- If the actors are robust to message send failures then the dropping messages
|
||||
approach can be used (with no forwarding or buffering needed).
|
||||
|
||||
- If an actor is a singleton (only one instance possible throughout the cluster)
|
||||
and state is transferred during the migration initialization, then options 2b
|
||||
and 3b would be required.
|
||||
|
||||
|
||||
Stateful Actor Replication :ref:`[*] <niy>`
|
||||
===========================================
|
||||
|
||||
.. note:: Stateful actor replication is not implemented yet.
|
||||
|
||||
.. _niy:
|
||||
|
||||
[*] Not Implemented Yet
|
||||
=======================
|
||||
|
||||
* Actor partitioning
|
||||
* Actor handoff
|
||||
* Actor rebalancing
|
||||
* Stateful actor replication
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue