Merge pull request #20809 from 2m/wip-#20808-restart-node-2m
#20808 clarify docs on the quarantined node restart
This commit is contained in:
commit
e39255cef0
2 changed files with 18 additions and 18 deletions
|
|
@ -147,7 +147,7 @@ status to ``down`` automatically after the configured time of unreachability.
|
||||||
|
|
||||||
This is a naïve approach to remove unreachable nodes from the cluster membership. It
|
This is a naïve approach to remove unreachable nodes from the cluster membership. It
|
||||||
works great for crashes and short transient network partitions, but not for long network
|
works great for crashes and short transient network partitions, but not for long network
|
||||||
partitions. Both sides of the network partition will see the other side as unreachable
|
partitions. Both sides of the network partition will see the other side as unreachable
|
||||||
and after a while remove it from its cluster membership. Since this happens on both
|
and after a while remove it from its cluster membership. Since this happens on both
|
||||||
sides the result is that two separate disconnected clusters have been created. This
|
sides the result is that two separate disconnected clusters have been created. This
|
||||||
can also happen because of long GC pauses or system overload.
|
can also happen because of long GC pauses or system overload.
|
||||||
|
|
@ -155,14 +155,14 @@ can also happen because of long GC pauses or system overload.
|
||||||
.. warning::
|
.. warning::
|
||||||
|
|
||||||
We recommend against using the auto-down feature of Akka Cluster in production.
|
We recommend against using the auto-down feature of Akka Cluster in production.
|
||||||
This is crucial for correct behavior if you use :ref:`cluster-singleton-java` or
|
This is crucial for correct behavior if you use :ref:`cluster-singleton-java` or
|
||||||
:ref:`cluster_sharding_java`, especially together with Akka :ref:`persistence-java`.
|
:ref:`cluster_sharding_java`, especially together with Akka :ref:`persistence-java`.
|
||||||
|
|
||||||
A pre-packaged solution for the downing problem is provided by
|
A pre-packaged solution for the downing problem is provided by
|
||||||
`Split Brain Resolver <http://doc.akka.io/docs/akka/rp-16s01p03/java/split-brain-resolver.html>`_,
|
`Split Brain Resolver <http://doc.akka.io/docs/akka/rp-16s01p03/java/split-brain-resolver.html>`_,
|
||||||
which is part of the Lightbend Reactive Platform. If you don’t use RP, you should anyway carefully
|
which is part of the Lightbend Reactive Platform. If you don’t use RP, you should anyway carefully
|
||||||
read the `documentation <http://doc.akka.io/docs/akka/rp-16s01p03/java/split-brain-resolver.html>`_
|
read the `documentation <http://doc.akka.io/docs/akka/rp-16s01p03/java/split-brain-resolver.html>`_
|
||||||
of the Split Brain Resolver and make sure that the solution you are using handles the concerns
|
of the Split Brain Resolver and make sure that the solution you are using handles the concerns
|
||||||
described there.
|
described there.
|
||||||
|
|
||||||
.. note:: If you have *auto-down* enabled and the failure detector triggers, you
|
.. note:: If you have *auto-down* enabled and the failure detector triggers, you
|
||||||
|
|
@ -427,8 +427,8 @@ If system messages cannot be delivered to a node it will be quarantined and then
|
||||||
cannot come back from ``unreachable``. This can happen if the there are too many
|
cannot come back from ``unreachable``. This can happen if the there are too many
|
||||||
unacknowledged system messages (e.g. watch, Terminated, remote actor deployment,
|
unacknowledged system messages (e.g. watch, Terminated, remote actor deployment,
|
||||||
failures of actors supervised by remote parent). Then the node needs to be moved
|
failures of actors supervised by remote parent). Then the node needs to be moved
|
||||||
to the ``down`` or ``removed`` states and the actor system must be restarted before
|
to the ``down`` or ``removed`` states and the actor system of the quarantined node
|
||||||
it can join the cluster again.
|
must be restarted before it can join the cluster again.
|
||||||
|
|
||||||
The nodes in the cluster monitor each other by sending heartbeats to detect if a node is
|
The nodes in the cluster monitor each other by sending heartbeats to detect if a node is
|
||||||
unreachable from the rest of the cluster. The heartbeat arrival times is interpreted
|
unreachable from the rest of the cluster. The heartbeat arrival times is interpreted
|
||||||
|
|
|
||||||
|
|
@ -142,7 +142,7 @@ status to ``down`` automatically after the configured time of unreachability.
|
||||||
|
|
||||||
This is a naïve approach to remove unreachable nodes from the cluster membership. It
|
This is a naïve approach to remove unreachable nodes from the cluster membership. It
|
||||||
works great for crashes and short transient network partitions, but not for long network
|
works great for crashes and short transient network partitions, but not for long network
|
||||||
partitions. Both sides of the network partition will see the other side as unreachable
|
partitions. Both sides of the network partition will see the other side as unreachable
|
||||||
and after a while remove it from its cluster membership. Since this happens on both
|
and after a while remove it from its cluster membership. Since this happens on both
|
||||||
sides the result is that two separate disconnected clusters have been created. This
|
sides the result is that two separate disconnected clusters have been created. This
|
||||||
can also happen because of long GC pauses or system overload.
|
can also happen because of long GC pauses or system overload.
|
||||||
|
|
@ -150,14 +150,14 @@ can also happen because of long GC pauses or system overload.
|
||||||
.. warning::
|
.. warning::
|
||||||
|
|
||||||
We recommend against using the auto-down feature of Akka Cluster in production.
|
We recommend against using the auto-down feature of Akka Cluster in production.
|
||||||
This is crucial for correct behavior if you use :ref:`cluster-singleton-scala` or
|
This is crucial for correct behavior if you use :ref:`cluster-singleton-scala` or
|
||||||
:ref:`cluster_sharding_scala`, especially together with Akka :ref:`persistence-scala`.
|
:ref:`cluster_sharding_scala`, especially together with Akka :ref:`persistence-scala`.
|
||||||
|
|
||||||
A pre-packaged solution for the downing problem is provided by
|
A pre-packaged solution for the downing problem is provided by
|
||||||
`Split Brain Resolver <http://doc.akka.io/docs/akka/rp-16s01p03/scala/split-brain-resolver.html>`_,
|
`Split Brain Resolver <http://doc.akka.io/docs/akka/rp-16s01p03/scala/split-brain-resolver.html>`_,
|
||||||
which is part of the Lightbend Reactive Platform. If you don’t use RP, you should anyway carefully
|
which is part of the Lightbend Reactive Platform. If you don’t use RP, you should anyway carefully
|
||||||
read the `documentation <http://doc.akka.io/docs/akka/rp-16s01p03/scala/split-brain-resolver.html>`_
|
read the `documentation <http://doc.akka.io/docs/akka/rp-16s01p03/scala/split-brain-resolver.html>`_
|
||||||
of the Split Brain Resolver and make sure that the solution you are using handles the concerns
|
of the Split Brain Resolver and make sure that the solution you are using handles the concerns
|
||||||
described there.
|
described there.
|
||||||
|
|
||||||
.. note:: If you have *auto-down* enabled and the failure detector triggers, you
|
.. note:: If you have *auto-down* enabled and the failure detector triggers, you
|
||||||
|
|
@ -422,8 +422,8 @@ If system messages cannot be delivered to a node it will be quarantined and then
|
||||||
cannot come back from ``unreachable``. This can happen if the there are too many
|
cannot come back from ``unreachable``. This can happen if the there are too many
|
||||||
unacknowledged system messages (e.g. watch, Terminated, remote actor deployment,
|
unacknowledged system messages (e.g. watch, Terminated, remote actor deployment,
|
||||||
failures of actors supervised by remote parent). Then the node needs to be moved
|
failures of actors supervised by remote parent). Then the node needs to be moved
|
||||||
to the ``down`` or ``removed`` states and the actor system must be restarted before
|
to the ``down`` or ``removed`` states and the actor system of the quarantined node
|
||||||
it can join the cluster again.
|
must be restarted before it can join the cluster again.
|
||||||
|
|
||||||
The nodes in the cluster monitor each other by sending heartbeats to detect if a node is
|
The nodes in the cluster monitor each other by sending heartbeats to detect if a node is
|
||||||
unreachable from the rest of the cluster. The heartbeat arrival times is interpreted
|
unreachable from the rest of the cluster. The heartbeat arrival times is interpreted
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue