Remove auto-downing, #27788 (#27855)

* moved to cluster tests, in new package akka.cluster.testkit * changed config in tests * migration guide * documentation clarificiations for Downing and Leaving * update warnings in Singleton and Sharding
2019-10-03 14:08:43 +02:00 · 2019-10-03 14:08:43 +02:00 · a217d5566e
commit a217d5566e
parent 064f06f5a6
61 changed files with 414 additions and 309 deletions
--- a/akka-docs/src/main/paradox/cluster-sharding.md
+++ b/akka-docs/src/main/paradox/cluster-sharding.md
@ -227,14 +227,6 @@ graceful leaving process of a cluster member.

 See @ref:[removal of Internal Cluster Sharding Data](typed/cluster-sharding.md#removal-of-internal-cluster-sharding-data) in the documentation of the new APIs.

-## Configuration
-
-`ClusterShardingSettings` is a parameter to the `start` method of
-the `ClusterSharding` extension, i.e. each each entity type can be configured with different settings
-if needed.
-
-See @ref:[configuration](typed/cluster-sharding.md#configuration) for more information.
-
 ## Inspecting cluster sharding state

 Two requests to inspect the cluster state are available:
@ -256,20 +248,13 @@ directly sending messages to the individual entities.

 ## Lease

-A @ref[lease](coordination.md) can be used as an additional safety measure to ensure a shard 
-does not run on two nodes.
+A lease can be used as an additional safety measure to ensure a shard does not run on two nodes.
+See @ref:[Lease](typed/cluster-sharding.md#lease) in the documentation of the new APIs.

-Reasons for how this can happen:
+## Configuration

-* Network partitions without an appropriate downing provider
-* Mistakes in the deployment process leading to two separate Akka Clusters
-* Timing issues between removing members from the Cluster on one side of a network partition and shutting them down on the other side
+`ClusterShardingSettings` is a parameter to the `start` method of
+the `ClusterSharding` extension, i.e. each each entity type can be configured with different settings
+if needed.

-A lease can be a final backup that means that each shard won't create child entity actors unless it has the lease. 
-
-To use a lease for sharding set `akka.cluster.sharding.use-lease` to the configuration location
-of the lease to use. Each shard will try and acquire a lease with with the name `<actor system name>-shard-<type name>-<shard id>` and
-the owner is set to the `Cluster(system).selfAddress.hostPort`.
-
-If a shard can't acquire a lease it will remain uninitialized so messages for entities it owns will
-be buffered in the `ShardRegion`. If the lease is lost after initialization the Shard will be terminated.
+See @ref:[configuration](typed/cluster-sharding.md#configuration) for more information.
--- a/akka-docs/src/main/paradox/cluster-usage.md
+++ b/akka-docs/src/main/paradox/cluster-usage.md
@ -104,6 +104,14 @@ Scala
 Java
 :  @@snip [SimpleClusterListener2.java](/akka-docs/src/test/java/jdocs/cluster/SimpleClusterListener2.java) { #join }

+## Leaving
+
+See @ref:[Leaving](typed/cluster.md#leaving) in the documentation of the new APIs.
+
+## Downing
+
+See @ref:[Downing](typed/cluster.md#downing) in the documentation of the new APIs.
+
 <a id="cluster-subscriber"></a>
 ## Subscribe to Cluster Events

--- a/akka-docs/src/main/paradox/project/links.md
+++ b/akka-docs/src/main/paradox/project/links.md
@ -3,7 +3,7 @@
 ## Commercial Support

 Commercial support is provided by [Lightbend](http://www.lightbend.com).
-Akka is part of the [Lightbend Reactive Platform](http://www.lightbend.com/platform).
+Akka is part of the [Lightbend Platform](http://www.lightbend.com/platform).

 ## Sponsors

--- a/akka-docs/src/main/paradox/project/migration-guide-2.5.x-2.6.x.md
+++ b/akka-docs/src/main/paradox/project/migration-guide-2.5.x-2.6.x.md
@ -11,6 +11,40 @@ is [no longer available as a static method](https://github.com/scala/bug/issues/

 If you are still using Scala 2.11 then you must upgrade to 2.12 or 2.13

+## Auto-downing removed
+
+Auto-downing of unreachable Cluster members have been removed after warnings and recommendations against using it
+for many years. It was by default disabled, but could be enabled with configuration
+`akka.cluster.auto-down-unreachable-after`.
+
+For alternatives see the @ref:[documentation about Downing](../typed/cluster.md#downing).
+
+Auto-downing was a naïve approach to remove unreachable nodes from the cluster membership.
+In a production environment it will eventually break down the cluster. 
+When a network partition occurs, both sides of the partition will see the other side as unreachable
+and remove it from the cluster. This results in the formation of two separate, disconnected, clusters
+(known as *Split Brain*).
+
+This behavior is not limited to network partitions. It can also occur if a node in the cluster is
+overloaded, or experiences a long GC pause.
+
+When using @ref:[Cluster Singleton](../typed/cluster-singleton.md) or @ref:[Cluster Sharding](../typed/cluster-sharding.md)
+it can break the contract provided by those features. Both provide a guarantee that an actor will be unique in a cluster.
+With the auto-down feature enabled, it is possible for multiple independent clusters to form (*Split Brain*).
+When this happens the guaranteed uniqueness will no longer be true resulting in undesirable behavior in the system.
+
+This is even more severe when @ref:[Akka Persistence](../typed/persistence.md) is used in conjunction with
+Cluster Sharding. In this case, the lack of unique actors can cause multiple actors to write to the same journal.
+Akka Persistence operates on a single writer principle. Having multiple writers will corrupt the journal
+and make it unusable. 
+
+Finally, even if you don't use features such as Persistence, Sharding, or Singletons, auto-downing can lead the
+system to form multiple small clusters. These small clusters will be independent from each other. They will be
+unable to communicate and as a result you may experience performance degradation. Once this condition occurs,
+it will require manual intervention in order to reform the cluster.
+
+Because of these issues, auto-downing should **never** be used in a production environment.
+
 ## Removed features that were deprecated

 After being deprecated since 2.5.0, the following have been removed in Akka 2.6.
@ -94,13 +128,25 @@ to make remote interactions look like local method calls.
 Warnings about `TypedActor` have been [mentioned in documentation](https://doc.akka.io/docs/akka/2.5/typed-actors.html#when-to-use-typed-actors)
 for many years.

+### akka-protobuf
+
+`akka-protobuf` was never intended to be used by end users but perhaps this was not well-documented.
+Applications should use standard Protobuf dependency instead of `akka-protobuf`. The artifact is still
+published, but the transitive dependency to `akka-protobuf` has been removed.
+
+Akka is now using Protobuf version 3.9.0 for serialization of messages defined by Akka.
+
+### Cluster Client
+
+Cluster client has been deprecated as of 2.6 in favor of [Akka gRPC](https://doc.akka.io/docs/akka-grpc/current/index.html).
+It is not advised to build new applications with Cluster client, and existing users @ref[should migrate to Akka gRPC](../cluster-client.md#migration-to-akka-grpc).

 ### akka.Main

 `akka.Main` is deprecated in favour of starting the `ActorSystem` from a custom main class instead. `akka.Main` was not
 adding much value and typically a custom main class is needed anyway.

-@@ Remoting
+## Remoting

 ### Default remoting is now Artery TCP

@ -184,20 +230,7 @@ For TCP:

 Classic remoting is deprecated but can be used in `2.6.` Explicitly disable Artery by setting property `akka.remote.artery.enabled` to `false`. Further, any configuration under `akka.remote` that is
 specific to classic remoting needs to be moved to `akka.remote.classic`. To see which configuration options
-are specific to classic search for them in: [`akka-remote/reference.conf`](/akka-remote/src/main/resources/reference.conf)
-
-### akka-protobuf
-
-`akka-protobuf` was never intended to be used by end users but perhaps this was not well-documented.
-Applications should use standard Protobuf dependency instead of `akka-protobuf`. The artifact is still
-published, but the transitive dependency to `akka-protobuf` has been removed.
-
-Akka is now using Protobuf version 3.9.0 for serialization of messages defined by Akka.
-
-### Cluster Client
-
-Cluster client has been deprecated as of 2.6 in favor of [Akka gRPC](https://doc.akka.io/docs/akka-grpc/current/index.html).
-It is not advised to build new applications with Cluster client, and existing users @ref[should migrate to Akka gRPC](../cluster-client.md#migration-to-akka-grpc).
+are specific to classic search for them in: @ref:[`akka-remote/reference.conf`](../general/configuration.md#config-akka-remote).

 ## Java Serialization

@ -235,14 +268,12 @@ handling that type and it was previously "accidentally" serialized with Java ser
 The following documents configuration changes and behavior changes where no action is required. In some cases the old
 behavior can be restored via configuration.

-### Remoting
-
-#### Remoting dependencies have been made optional
+### Remoting dependencies have been made optional

 Classic remoting depends on Netty and Artery UDP depends on Aeron. These are now both optional dependencies that need
 to be explicitly added. See @ref[classic remoting](../remoting.md) or @ref[artery remoting](../remoting-artery.md) for instructions.

-#### Remote watch and deployment have been disabled without Cluster use
+### Remote watch and deployment have been disabled without Cluster use

 By default, these remoting features are disabled when not using Akka Cluster:

--- a/akka-docs/src/main/paradox/typed/cluster-sharding.md
+++ b/akka-docs/src/main/paradox/typed/cluster-sharding.md
@ -43,10 +43,10 @@ if that feature is enabled.

@@@ warning

-**Don't use Cluster Sharding together with Automatic Downing**,
-since it allows the cluster to split up into two separate clusters, which in turn will result
-in *multiple shards and entities* being started, one in each separate cluster!
-See @ref:[Downing](cluster.md#automatic-vs-manual-downing).
+Make sure to not use a Cluster downing strategy that may split the cluster into several separate clusters in
+case of network problems or system overload (long GC pauses), since that will result in *multiple shards and entities*
+being started, one in each separate cluster!
+See @ref:[Downing](cluster.md#downing).

@@@

@ -304,6 +304,26 @@ rebalanced to other nodes.
 See @ref:[How To Startup when Cluster Size Reached](cluster.md#how-to-startup-when-a-cluster-size-is-reached)
 for more information about `min-nr-of-members`.

+## Lease
+
+A @ref[lease](../coordination.md) can be used as an additional safety measure to ensure a shard 
+does not run on two nodes.
+
+Reasons for how this can happen:
+
+* Network partitions without an appropriate downing provider
+* Mistakes in the deployment process leading to two separate Akka Clusters
+* Timing issues between removing members from the Cluster on one side of a network partition and shutting them down on the other side
+
+A lease can be a final backup that means that each shard won't create child entity actors unless it has the lease. 
+
+To use a lease for sharding set `akka.cluster.sharding.use-lease` to the configuration location
+of the lease to use. Each shard will try and acquire a lease with with the name `<actor system name>-shard-<type name>-<shard id>` and
+the owner is set to the `Cluster(system).selfAddress.hostPort`.
+
+If a shard can't acquire a lease it will remain uninitialized so messages for entities it owns will
+be buffered in the `ShardRegion`. If the lease is lost after initialization the Shard will be terminated.
+
 ## Removal of internal Cluster Sharding data

 Removal of internal Cluster Sharding data is only relevant for "Persistent Mode".
@ -326,15 +346,6 @@ cannot startup because of corrupt data, which may happen if accidentally
 two clusters were running at the same time, e.g. caused by using auto-down
 and there was a network partition.

-@@@ warning
-
-**Don't use Cluster Sharding together with Automatic Downing**,
-since it allows the cluster to split up into two separate clusters, which in turn will result
-in *multiple shards and entities* being started, one in each separate cluster!
-See @ref:[Downing](cluster.md#automatic-vs-manual-downing).
-
-@@@
-
 Use this program as a standalone Java main program:

 ```
@ -347,7 +358,7 @@ The program is included in the `akka-cluster-sharding` jar file. It
 is easiest to run it with same classpath and configuration as your ordinary
 application. It can be run from sbt or Maven in similar way.

-Specify the entity type names (same as you use in the `start` method
+Specify the entity type names (same as you use in the `init` method
 of `ClusterSharding`) as program arguments.

 If you specify `-2.3` as the first program argument it will also try
--- a/akka-docs/src/main/paradox/typed/cluster-singleton.md
+++ b/akka-docs/src/main/paradox/typed/cluster-singleton.md
@ -32,6 +32,15 @@ such as single-point of bottleneck. Single-point of failure is also a relevant c
 but for some cases this feature takes care of that by making sure that another singleton
 instance will eventually be started.

+@@@ warning
+
+Make sure to not use a Cluster downing strategy that may split the cluster into several separate clusters in
+case of network problems or system overload (long GC pauses), since that will result in in *multiple Singletons*
+being started, one in each separate cluster!
+See @ref:[Downing](cluster.md#downing).
+
+@@@
+
 ### Singleton manager

 The cluster singleton pattern manages one singleton actor instance among all cluster nodes or a group of nodes tagged with
@ -80,23 +89,20 @@ The singleton instance will not run on members with status @ref:[WeaklyUp](clust

 This pattern may seem to be very tempting to use at first, but it has several drawbacks, some of them are listed below:

- * the cluster singleton may quickly become a *performance bottleneck*,
- * you can not rely on the cluster singleton to be *non-stop* available — e.g. when the node on which the singleton has
-been running dies, it will take a few seconds for this to be noticed and the singleton be migrated to another node,
- * in the case of a *network partition* appearing in a Cluster that is using Automatic Downing (see docs for
-@ref:[Auto Downing](cluster.md#auto-downing-do-not-use),
-it may happen that the isolated clusters each decide to spin up their own singleton, meaning that there might be multiple
-singletons running in the system, yet the Clusters have no way of finding out about them (because of the partition).
-
-Especially the last point is something you should be aware of — in general when using the Cluster Singleton pattern
-you should take care of downing nodes yourself and not rely on the timing based auto-down feature.
+ * The cluster singleton may quickly become a *performance bottleneck*.
+ * You can not rely on the cluster singleton to be *non-stop* available — e.g. when the node on which the singleton
+   has been running dies, it will take a few seconds for this to be noticed and the singleton be migrated to another node.
+ * If many singletons are used be aware of that all will run on the oldest node (or oldest with configured role).
+   @ref:[Cluster Sharding](cluster-sharding.md) combined with keeping the "singleton" entities alive can be a better
+   alternative. 

@@@ warning
-
-**Don't use Cluster Singleton together with Automatic Downing**,
-since it allows the cluster to split up into two separate clusters, which in turn will result
-in *multiple Singletons* being started, one in each separate cluster!
-
+ 
+Make sure to not use a Cluster downing strategy that may split the cluster into several separate clusters in
+case of network problems or system overload (long GC pauses), since that will result in in *multiple Singletons*
+being started, one in each separate cluster!
+See @ref:[Downing](cluster.md#downing).
+ 
@@@

 ## Example
--- a/akka-docs/src/main/paradox/typed/cluster.md
+++ b/akka-docs/src/main/paradox/typed/cluster.md
@ -255,95 +255,69 @@ after the restart, when it come up as new incarnation of existing member in the
 trying to join in, then the existing one will be removed from the cluster and then it will
 be allowed to join.

-<a id="automatic-vs-manual-downing"></a>
-### Downing
-
-When a member is considered by the failure detector to be `unreachable` the
-leader is not allowed to perform its duties, such as changing status of
-new joining members to 'Up'. The node must first become `reachable` again, or the
-status of the unreachable member must be changed to 'Down'. Changing status to 'Down'
-can be performed automatically or manually. By default it must be done manually, using
-@ref:[JMX](../additional/operations.md#jmx) or @ref:[HTTP](../additional/operations.md#http).
-
-It can also be performed programmatically with @scala[`Cluster(system).down(address)`]@java[`Cluster.get(system).down(address)`].
-
-If a node is still running and sees its self as Down it will shutdown. @ref:[Coordinated Shutdown](../actors.md#coordinated-shutdown) will automatically
-run if `run-coordinated-shutdown-when-down` is set to `on` (the default) however the node will not try
-and leave the cluster gracefully so sharding and singleton migration will not occur.
-
-A production solution for the downing problem is provided by
-[Split Brain Resolver](http://developer.lightbend.com/docs/akka-commercial-addons/current/split-brain-resolver.html),
-which is part of the [Lightbend Reactive Platform](http://www.lightbend.com/platform).
-If you don’t use RP, you should anyway carefully read the [documentation](http://developer.lightbend.com/docs/akka-commercial-addons/current/split-brain-resolver.html)
-of the Split Brain Resolver and make sure that the solution you are using handles the concerns
-described there.
-
-### Auto-downing - DO NOT USE
-
-There is an automatic downing feature that you should not use in production. For testing you can enable it with configuration:
-
-```
-akka.cluster.auto-down-unreachable-after = 120s
-```
-
-This means that the cluster leader member will change the `unreachable` node
-status to `down` automatically after the configured time of unreachability.
-
-This is a naïve approach to remove unreachable nodes from the cluster membership.
-It can be useful during development but in a production environment it will eventually breakdown the cluster. 
-When a network partition occurs, both sides of the partition will see the other side as unreachable and remove it from the cluster.
-This results in the formation of two separate, disconnected, clusters (known as *Split Brain*).
-
-This behaviour is not limited to network partitions. It can also occur if a node
-in the cluster is overloaded, or experiences a long GC pause.
-
-@@@ warning
-
-We recommend against using the auto-down feature of Akka Cluster in production. It
-has multiple undesirable consequences for production systems.
-
-If you are using @ref:[Cluster Singleton](cluster-singleton.md) or @ref:[Cluster Sharding](cluster-sharding.md) it can break the contract provided by 
-those features. Both provide a guarantee that an actor will be unique in a cluster.
-With the auto-down feature enabled, it is possible for multiple independent clusters
-to form (*Split Brain*). When this happens the guaranteed uniqueness will no
-longer be true resulting in undesirable behaviour in the system.
-
-This is even more severe when @ref:[Akka Persistence](persistence.md) is used in
-conjunction with Cluster Sharding. In this case, the lack of unique actors can 
-cause multiple actors to write to the same journal. Akka Persistence operates on a
-single writer principle. Having multiple writers will corrupt the journal
-and make it unusable. 
-
-Finally, even if you don't use features such as Persistence, Sharding, or Singletons, 
-auto-downing can lead the system to form multiple small clusters. These small
-clusters will be independent from each other. They will be unable to communicate
-and as a result you may experience performance degradation. Once this condition
-occurs, it will require manual intervention in order to reform the cluster.
-
-Because of these issues, auto-downing should **never** be used in a production environment.
-
-@@@
-
 ### Leaving

-There are two ways to remove a member from the cluster.
+There are a few ways to remove a member from the cluster.

-1. The recommended way to leave a cluster is a graceful exit, informing the cluster that a node shall leave. 
-This can be performed using @ref:[JMX](../additional/operations.md#jmx) or @ref:[HTTP](../additional/operations.md#http). 
-This method will offer faster hand off to peer nodes during node shutdown.
-1. When a graceful exit is not possible, you can stop the actor system (or the JVM process, for example a SIGTERM sent from the environment). It will be detected
-as unreachable and removed after the automatic or manual downing.
+1. The recommended way to leave a cluster is a graceful exit, informing the cluster that a node shall leave.
+  This is performed by @ref:[Coordinated Shutdown](../actors.md#coordinated-shutdown) when the `ActorSystem`
+  is terminated and also when a SIGTERM is sent from the environment to stop the JVM process.
+1. Graceful exit can also be performed using @ref:[HTTP](../additional/operations.md#http) or @ref:[JMX](../additional/operations.md#jmx). 
+1. When a graceful exit is not possible, for example in case of abrupt termination of the the JVM process, the node
+  will be detected as unreachable by other nodes and removed after @ref:[Downing](#downing).

-The @ref:[Coordinated Shutdown](../actors.md#coordinated-shutdown) will automatically run when the cluster node sees itself as
+Graceful leaving will offer faster hand off to peer nodes during node shutdown than abrupt termination and downing.
+
+The @ref:[Coordinated Shutdown](../actors.md#coordinated-shutdown) will also run when the cluster node sees itself as
 `Exiting`, i.e. leaving from another node will trigger the shutdown process on the leaving node.
 Tasks for graceful leaving of cluster including graceful shutdown of Cluster Singletons and
 Cluster Sharding are added automatically when Akka Cluster is used, i.e. running the shutdown
 process will also trigger the graceful leaving if it's not already in progress.

 Normally this is handled automatically, but in case of network failures during this process it might still
-be necessary to set the node’s status to `Down` in order to complete the removal. For handling network failures
-see [Split Brain Resolver](http://developer.lightbend.com/docs/akka-commercial-addons/current/split-brain-resolver.html),
-part of the [Lightbend Reactive Platform](http://www.lightbend.com/platform).
+be necessary to set the node’s status to `Down` in order to complete the removal, see @ref:[Downing](#downing).
+
+### Downing
+
+In many cases a member can gracefully exit from the cluster as described in @ref:[Leaving](#leaving), but
+there are scenarios when an explicit downing decision is needed before it can be removed. For example in case
+of abrupt termination of the the JVM process, system overload that doesn't recover, or network partitions
+that don't heal. I such cases the node(s) will be detected as unreachable by other nodes, but they must also
+be marked as `Down` before they are removed.
+
+When a member is considered by the failure detector to be `unreachable` the
+leader is not allowed to perform its duties, such as changing status of
+new joining members to 'Up'. The node must first become `reachable` again, or the
+status of the unreachable member must be changed to `Down`. Changing status to `Down`
+can be performed automatically or manually.
+
+By default, downing must be performed manually using @ref:[HTTP](../additional/operations.md#http) or @ref:[JMX](../additional/operations.md#jmx).
+
+Note that @ref:[Cluster Singleton](cluster-singleton.md) or @ref:[Cluster Sharding entities](cluster-sharding.md) that
+are running on a crashed (unreachable) node will not be started on another node until the previous node has
+been removed from the Cluster. Removal of crashed (unreachable) nodes is performed after a downing decision.
+
+A production solution for downing is provided by
+[Split Brain Resolver](https://doc.akka.io/docs/akka-enhancements/current/split-brain-resolver.html),
+which is part of the [Lightbend Platform](http://www.lightbend.com/platform).
+If you don’t have a Lightbend Platform Subscription, you should still carefully read the 
+[documentation](https://doc.akka.io/docs/akka-enhancements/current/split-brain-resolver.html)
+of the Split Brain Resolver and make sure that the solution you are using handles the concerns and scenarios
+described there.
+
+A custom downing strategy can be implemented with a @apidoc[akka.cluster.DowningProvider] and enabled with
+configuration `akka.cluster.downing-provider-class`.  
+
+Downing can also be performed programmatically with @scala[`Cluster(system).manager ! Down(address)`]@java[`Cluster.get(system).manager().tell(Down(address))`],
+but that is mostly useful from tests and when implementing a `DowningProvider`.
+
+If a crashed node is restarted with the same hostname and port and joining the cluster again the previous incarnation
+of that member will be downed and removed. The new join attempt with same hostname and port is used as evidence
+that the previous is not alive any more.
+
+If a node is still running and sees its self as `Down` it will shutdown. @ref:[Coordinated Shutdown](../actors.md#coordinated-shutdown) will automatically
+run if `run-coordinated-shutdown-when-down` is set to `on` (the default) however the node will not try
+and leave the cluster gracefully.

 ## Node Roles