Add Lightbend's SBR to Akka Cluster, #29085 (#29099)

* change package name to akka.cluster.sbr
* reference.conf has same config paths
* akka.cluster.sbr.SplitBrainResolverProvider instead of com.lightbend.akka.sbr.SplitBrainResolverProvider
* dependency from akka-cluster to akka-coordination, for lease strategy
* move TestLease to akka-coordination and use that in SBR tests
* remove keep-referee strategy
* use keep-majority by default
* review and adjust reference documentation

Co-authored-by: Johan Andrén <johan@markatta.com>
Co-authored-by: Johannes Rudolph <johannes.rudolph@gmail.com>
Co-authored-by: Christopher Batey <christopher.batey@gmail.com>
Co-authored-by: Arnout Engelen <github@bzzt.net>
This commit is contained in:
Patrik Nordwall 2020-05-25 12:21:13 +02:00 committed by GitHub
parent e0586e546c
commit c45e6ef39b
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
37 changed files with 5612 additions and 67 deletions

View file

@ -67,7 +67,7 @@ Environments such as Kubernetes send a SIGTERM, however if the JVM is wrapped wi
In case of network failures it may still be necessary to set the node's status to Down in order to complete the removal.
@ref:[Cluster Downing](../typed/cluster.md#downing) details downing nodes and downing providers.
[Split Brain Resolver](https://doc.akka.io/docs/akka-enhancements/current/split-brain-resolver.html) can be used to ensure
@ref:[Split Brain Resolver](../split-brain-resolver.md) can be used to ensure
the cluster continues to function during network partitions and node failures. For example
if there is an unreachability problem Split Brain Resolver would make a decision based on the configured downing strategy.

View file

@ -26,6 +26,7 @@ An Akka Persistence journal and snapshot store backed by Couchbase.
* [Akka Cluster Bootstrap](https://doc.akka.io/docs/akka-management/current/bootstrap/) helps bootstrapping an Akka cluster using Akka Discovery.
* [Akka Management Cluster HTTP](https://doc.akka.io/docs/akka-management/current/cluster-http-management.html) provides HTTP endpoints for introspecting and managing Akka clusters.
* [Akka Discovery for Kubernetes, Consul, Marathon, and AWS](https://doc.akka.io/docs/akka-management/current/discovery/)
* [Kubernetes Lease](https://doc.akka.io/docs/akka-management/current/kubernetes-lease.html)
## [Akka gRPC](https://doc.akka.io/docs/akka-grpc/current/)
@ -33,8 +34,6 @@ Akka gRPC provides support for building streaming gRPC servers and clients on to
## Akka Resilience Enhancements
* [Akka Split Brain Resolver](https://doc.akka.io/docs/akka-enhancements/current/split-brain-resolver.html)
* [Kubernetes Lease](https://doc.akka.io/docs/akka-enhancements/current/kubernetes-lease.html)
* [Akka Thread Starvation Detector](https://doc.akka.io/docs/akka-enhancements/current/starvation-detector.html)
* [Akka Configuration Checker](https://doc.akka.io/docs/akka-enhancements/current/config-checker.html)
* [Akka Diagnostics Recorder](https://doc.akka.io/docs/akka-enhancements/current/diagnostics-recorder.html)
@ -44,7 +43,6 @@ Akka gRPC provides support for building streaming gRPC servers and clients on to
* [Akka Multi-DC Persistence](https://doc.akka.io/docs/akka-enhancements/current/persistence-dc/index.html)
* [Akka GDPR for Persistence](https://doc.akka.io/docs/akka-enhancements/current/gdpr/index.html)
## Community Projects
Akka has a vibrant and passionate user community, the members of which have created many independent projects using Akka as well as extensions to it. See [Community Projects](https://akka.io/community/).

View file

@ -35,10 +35,10 @@ Any lease implementation should provide the following guarantees:
To acquire a lease:
Scala
: @@snip [LeaseDocSpec.scala](/akka-coordination/src/test/scala/docs/akka/coordination/LeaseDocSpec.scala) { #lease-usage }
: @@snip [LeaseDocSpec.scala](/akka-docs/src/test/scala/docs/coordination/LeaseDocSpec.scala) { #lease-usage }
Java
: @@snip [LeaseDocTest.java](/akka-coordination/src/test/java/jdocs/akka/coordination/lease/LeaseDocTest.java) { #lease-usage }
: @@snip [LeaseDocTest.java](/akka-docs/src/test/java/jdocs/coordination/LeaseDocTest.java) { #lease-usage }
Acquiring a lease returns a @scala[Future]@java[CompletionStage] as lease implementations typically are implemented
via a third party system such as the Kubernetes API server or Zookeeper.
@ -53,10 +53,10 @@ It is important to pick a lease name that will be unique for your use case. If a
in a Cluster the cluster host port can be use:
Scala
: @@snip [LeaseDocSpec.scala](/akka-coordination/src/test/scala/docs/akka/coordination/LeaseDocSpec.scala) { #cluster-owner }
: @@snip [LeaseDocSpec.scala](/akka-docs/src/test/scala/docs/coordination/LeaseDocSpec.scala) { #cluster-owner }
Java
: @@snip [LeaseDocTest.scala](/akka-coordination/src/test/java/jdocs/akka/coordination/lease/LeaseDocTest.java) { #cluster-owner }
: @@snip [LeaseDocTest.scala](/akka-docs/src/test/java/jdocs/coordination/LeaseDocTest.java) { #cluster-owner }
For use cases where multiple different leases on the same node then something unique must be added to the name. For example
a lease can be used with Cluster Sharding and in this case the shard Id is included in the lease name for each shard.
@ -77,7 +77,7 @@ Leases can be used for @ref[Cluster Singletons](cluster-singleton.md#lease) and
## Lease implementations
* [Kubernetes API](https://doc.akka.io/docs/akka-enhancements/current/kubernetes-lease.html)
* [Kubernetes API](https://doc.akka.io/docs/akka-management/current/kubernetes-lease.html)
## Implementing a lease
@ -85,10 +85,10 @@ Implementations should extend
the @scala[`akka.coordination.lease.scaladsl.Lease`]@java[`akka.coordination.lease.javadsl.Lease`]
Scala
: @@snip [LeaseDocSpec.scala](/akka-coordination/src/test/scala/docs/akka/coordination/LeaseDocSpec.scala) { #lease-example }
: @@snip [LeaseDocSpec.scala](/akka-docs/src/test/scala/docs/coordination/LeaseDocSpec.scala) { #lease-example }
Java
: @@snip [LeaseDocTest.scala](/akka-coordination/src/test/java/jdocs/akka/coordination/lease/LeaseDocTest.java) { #lease-example }
: @@snip [LeaseDocTest.java](/akka-docs/src/test/java/jdocs/coordination/LeaseDocTest.java) { #lease-example }
The methods should provide the following guarantees:
@ -109,10 +109,10 @@ The lease implementation should have support for the following properties where
This configuration location is passed into `getLease`.
Scala
: @@snip [LeaseDocSpec.scala](/akka-coordination/src/test/scala/docs/akka/coordination/LeaseDocSpec.scala) { #lease-config }
: @@snip [LeaseDocSpec.scala](/akka-docs/src/test/scala/docs/coordination/LeaseDocSpec.scala) { #lease-config }
Java
: @@snip [LeaseDocSpec.scala](/akka-coordination/src/test/scala/docs/akka/coordination/LeaseDocSpec.scala) { #lease-config }
: @@snip [LeaseDocSpec.scala](/akka-docs/src/test/scala/docs/coordination/LeaseDocSpec.scala) { #lease-config }

View file

@ -0,0 +1,481 @@
# Split Brain Resolver
When operating an Akka cluster you must consider how to handle
[network partitions](http://en.wikipedia.org/wiki/Network_partition) (a.k.a. split brain scenarios)
and machine crashes (including JVM and hardware failures). This is crucial for correct behavior if
you use @ref:[Cluster Singleton](typed/cluster-singleton.md) or @ref:[Cluster Sharding](typed/cluster-sharding.md),
especially together with Akka Persistence.
## Module info
To use Akka Split Brain Resolver is part of `akka-cluster` and you probably already have that
dependency included. Otherwise, add the following dependency in your project:
@@dependency[sbt,Maven,Gradle] {
group=com.typesafe.akka
artifact=akka-cluster_$scala.binary.version$
version=$akka.version$
}
@@project-info{ projectId="akka-cluster" }
## Enable the Split Brain Resolver
You need to enable the Split Brain Resolver by configuring it as downing provider in the configuration of
the `ActorSystem` (`application.conf`):
```
akka.cluster.downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
```
You should also consider the different available @ref:[downing strategies](#strategies).
## The Problem
A fundamental problem in distributed systems is that network partitions (split brain scenarios) and
machine crashes are indistinguishable for the observer, i.e. a node can observe that there is a problem
with another node, but it cannot tell if it has crashed and will never be available again or if there is
a network issue that might or might not heal again after a while. Temporary and permanent failures are
indistinguishable because decisions must be made in finite time, and there always exists a temporary
failure that lasts longer than the time limit for the decision.
A third type of problem is if a process is unresponsive, e.g. because of overload, CPU starvation or
long garbage collection pauses. This is also indistinguishable from network partitions and crashes.
The only signal we have for decision is "no reply in given time for heartbeats" and this means that
phenomena causing delays or lost heartbeats are indistinguishable from each other and must be
handled in the same way.
When there is a crash, we would like to remove the affected node immediately from the cluster membership.
When there is a network partition or unresponsive process we would like to wait for a while in the hope
that it is a transient problem that will heal again, but at some point, we must give up and continue with
the nodes on one side of the partition and shut down nodes on the other side. Also, certain features are
not fully available during partitions so it might not matter that the partition is transient or not if
it just takes too long. Those two goals are in conflict with each other and there is a trade-off
between how quickly we can remove a crashed node and premature action on transient network partitions.
This is a difficult problem to solve given that the nodes on the different sides of the network partition
cannot communicate with each other. We must ensure that both sides can make this decision by themselves and
that they take the same decision about which part will keep running and which part will shut itself down.
Another type of problem that makes it difficult to see the "right" picture is when some nodes are not fully
connected and cannot communicate directly to each other but information can be disseminated between them via
other nodes.
The Akka cluster has a failure detector that will notice network partitions and machine crashes (but it
cannot distinguish the two). It uses periodic heartbeat messages to check if other nodes are available
and healthy. These observations by the failure detector are referred to as a node being *unreachable*
and it may become *reachable* again if the failure detector observes that it can communicate with it again.
The failure detector in itself is not enough for making the right decision in all situations.
The naive approach is to remove an unreachable node from the cluster membership after a timeout.
This works great for crashes and short transient network partitions, but not for long network
partitions. Both sides of the network partition will see the other side as unreachable and
after a while remove it from its cluster membership. Since this happens on both sides the result
is that two separate disconnected clusters have been created.
This approach is provided by the opt-in (off by default) auto-down feature in the OSS version of
Akka Cluster.
If you use the timeout based auto-down feature in combination with Cluster Singleton or Cluster Sharding
that would mean that two singleton instances or two sharded entities with the same identifier would be running.
One would be running: one in each cluster.
For example when used together with Akka Persistence that could result in that two instances of a
persistent actor with the same `persistenceId` are running and writing concurrently to the
same stream of persistent events, which will have fatal consequences when replaying these events.
The default setting in Akka Cluster is to not remove unreachable nodes automatically and
the recommendation is that the decision of what to
do should be taken by a human operator or an external monitoring system. This is a valid solution,
but not very convenient if you do not have this staff or external system for other reasons.
If the unreachable nodes are not downed at all they will still be part of the cluster membership.
Meaning that Cluster Singleton and Cluster Sharding will not failover to another node. While there
are unreachable nodes new nodes that are joining the cluster will not be promoted to full worthy
members (with status Up). Similarly, leaving members will not be removed until all unreachable
nodes have been resolved. In other words, keeping unreachable members for an unbounded time is
undesirable.
With that introduction of the problem domain, it is time to look at the provided strategies for
handling network partition, unresponsive nodes and crashed nodes.
## Strategies
By default the @ref:[Keep Majority](#keep-majority) strategy will be used because it works well for
most systems. However, it's wort considering the other available strategies and pick a strategy that fits
the characteristics of your system. For example, in a Kubernetes environment the @ref:[Lease](#lease) strategy
can be a good choice.
Every strategy has a failure scenario where it makes a "wrong" decision. This section describes the different
strategies and guidelines of when to use what.
When there is uncertainty it selects to down more nodes than necessary, or even downing of all nodes.
Therefore Split Brain Resolver should always be combined with a mechanism to automatically start up nodes that
have been shutdown, and join them to the existing cluster or form a new cluster again.
You enable a strategy with the configuration property `akka.cluster.split-brain-resolver.active-strategy`.
### Stable after
All strategies are inactive until the cluster membership and the information about unreachable nodes
have been stable for a certain time period. Continuously adding more nodes while there is a network
partition does not influence this timeout, since the status of those nodes will not be changed to Up
while there are unreachable nodes. Joining nodes are not counted in the logic of the strategies.
@@snip [reference.conf](/akka-cluster/src/main/resources/reference.conf) { #split-brain-resolver }
Set `akka.cluster.split-brain-resolver.stable-after` to a shorter duration to have quicker removal of crashed nodes,
at the price of risking too early action on transient network partitions that otherwise would have healed. Do not
set this to a shorter duration than the membership dissemination time in the cluster, which depends
on the cluster size. Recommended minimum duration for different cluster sizes:
|cluster size | stable-after|
|-------------|-------------|
|5 | 7 s |
|10 | 10 s|
|20 | 13 s|
|50 | 17 s|
|100 | 20 s|
|1000 | 30 s|
The different strategies may have additional settings that are described below.
@@@ note
It is important that you use the same configuration on all nodes.
@@@
The side of the split that decides to shut itself down will use the cluster *down* command
to initiate the removal of a cluster member. When that has been spread among the reachable nodes
it will be removed from the cluster membership.
It's good to terminate the `ActorSystem` and exit the JVM when the node is removed from the cluster.
That is handled by @ref:[Coordinated Shutdown](coordinated-shutdown.md)
but to exit the JVM it's recommended that you enable:
```
akka.coordinated-shutdown.exit-jvm = on
```
@@@ note
Some legacy containers may block calls to System.exit(..) and you may have to find an alternate
way to shut the app down. For example, when running Akka on top of a Spring / Tomcat setup, you
could replace the call to `System.exit(..)` with a call to Spring's ApplicationContext .close() method
(or with a HTTP call to Tomcat Manager's API to un-deploy the app).
@@@
### Keep Majority
The strategy named `keep-majority` will down the unreachable nodes if the current node is in
the majority part based on the last known membership information. Otherwise down the reachable nodes,
i.e. the own part. If the parts are of equal size the part containing the node with the lowest
address is kept.
This strategy is a good choice when the number of nodes in the cluster change dynamically and you can
therefore not use `static-quorum`.
This strategy also handles the edge case that may occur when there are membership changes at the same
time as the network partition occurs. For example, the status of two members are changed to `Up`
on one side but that information is not disseminated to the other side before the connection is broken.
Then one side sees two more nodes and both sides might consider themselves having a majority. It will
detect this situation and make the safe decision to down all nodes on the side that could be in minority
if the joining nodes were changed to `Up` on the other side. Note that this has the drawback that
if the joining nodes were not changed to `Up` and becoming a majority on the other side then each part
will shut down itself, terminating the whole cluster.
Note that if there are more than two partitions and none is in majority each part will shut down
itself, terminating the whole cluster.
If more than half of the nodes crash at the same time the other running nodes will down themselves
because they think that they are not in majority, and thereby the whole cluster is terminated.
The decision can be based on nodes with a configured `role` instead of all nodes in the cluster.
This can be useful when some types of nodes are more valuable than others. You might for example
have some nodes responsible for persistent data and some nodes with stateless worker services.
Then it probably more important to keep as many persistent data nodes as possible even though
it means shutting down more worker nodes.
Configuration:
```
akka.cluster.split-brain-resolver.active-strategy=keep-majority
```
@@snip [reference.conf](/akka-cluster/src/main/resources/reference.conf) { #keep-majority }
### Static Quorum
The strategy named `static-quorum` will down the unreachable nodes if the number of remaining
nodes are greater than or equal to a configured `quorum-size`. Otherwise, it will down the reachable nodes,
i.e. it will shut down that side of the partition. In other words, the `quorum-size` defines the minimum
number of nodes that the cluster must have to be operational.
This strategy is a good choice when you have a fixed number of nodes in the cluster, or when you can
define a fixed number of nodes with a certain role.
For example, in a 9 node cluster you will configure the `quorum-size` to 5. If there is a network split
of 4 and 5 nodes the side with 5 nodes will survive and the other 4 nodes will be downed. After that,
in the 5 node cluster, no more failures can be handled, because the remaining cluster size would be
less than 5. In the case of another failure in that 5 node cluster all nodes will be downed.
Therefore it is important that you join new nodes when old nodes have been removed.
Another consequence of this is that if there are unreachable nodes when starting up the cluster,
before reaching this limit, the cluster may shut itself down immediately. This is not an issue
if you start all nodes at approximately the same time or use the `akka.cluster.min-nr-of-members`
to define required number of members before the leader changes member status of 'Joining' members to 'Up'
You can tune the timeout after which downing decisions are made using the `stable-after` setting.
You should not add more members to the cluster than **quorum-size * 2 - 1**. A warning is logged
if this recommendation is violated. If the exceeded cluster size remains when a SBR decision is
needed it will down all nodes because otherwise there is a risk that both sides may down each
other and thereby form two separate clusters.
For rolling updates it's best to leave the cluster gracefully via
@ref:[Coordinated Shutdown](coordinated-shutdown.md) (SIGTERM).
For successful leaving SBR will not be used (no downing) but if there is an unreachability problem
at the same time as the rolling update is in progress there could be an SBR decision. To avoid that
the total number of members limit is not exceeded during the rolling update it's recommended to
leave and fully remove one node before adding a new one, when using `static-quorum`.
If the cluster is split into 3 (or more) parts each part that is smaller than then configured `quorum-size`
will down itself and possibly shutdown the whole cluster.
If more nodes than the configured `quorum-size` crash at the same time the other running nodes
will down themselves because they think that they are not in the majority, and thereby the whole
cluster is terminated.
The decision can be based on nodes with a configured `role` instead of all nodes in the cluster.
This can be useful when some types of nodes are more valuable than others. You might, for example,
have some nodes responsible for persistent data and some nodes with stateless worker services.
Then it probably more important to keep as many persistent data nodes as possible even though
it means shutting down more worker nodes.
There is another use of the `role` as well. By defining a `role` for a few (e.g. 7) stable
nodes in the cluster and using that in the configuration of `static-quorum` you will be able
to dynamically add and remove other nodes without this role and still have good decisions of what
nodes to keep running and what nodes to shut down in the case of network partitions. The advantage
of this approach compared to `keep-majority` (described below) is that you *do not* risk splitting
the cluster into two separate clusters, i.e. *a split brain**. You must still obey the rule of not
starting too many nodes with this `role` as described above. It also suffers the risk of shutting
down all nodes if there is a failure when there are not enough nodes with this `role` remaining
in the cluster, as described above.
Configuration:
```
akka.cluster.split-brain-resolver.active-strategy=static-quorum
```
@@snip [reference.conf](/akka-cluster/src/main/resources/reference.conf) { #static-quorum }
### Keep Oldest
The strategy named `keep-oldest` will down the part that does not contain the oldest
member. The oldest member is interesting because the active Cluster Singleton instance
is running on the oldest member.
There is one exception to this rule if `down-if-alone` is configured to `on`.
Then, if the oldest node has partitioned from all other nodes the oldest will down itself
and keep all other nodes running. The strategy will not down the single oldest node when
it is the only remaining node in the cluster.
Note that if the oldest node crashes the others will remove it from the cluster
when `down-if-alone` is `on`, otherwise they will down themselves if the
oldest node crashes, i.e. shut down the whole cluster together with the oldest node.
This strategy is good to use if you use Cluster Singleton and do not want to shut down the node
where the singleton instance runs. If the oldest node crashes a new singleton instance will be
started on the next oldest node. The drawback is that the strategy may keep only a few nodes
in a large cluster. For example, if one part with the oldest consists of 2 nodes and the
other part consists of 98 nodes then it will keep 2 nodes and shut down 98 nodes.
This strategy also handles the edge case that may occur when there are membership changes at the same
time as the network partition occurs. For example, the status of the oldest member is changed to `Exiting`
on one side but that information is not disseminated to the other side before the connection is broken.
It will detect this situation and make the safe decision to down all nodes on the side that sees the oldest as
`Leaving`. Note that this has the drawback that if the oldest was `Leaving` and not changed to `Exiting` then
each part will shut down itself, terminating the whole cluster.
The decision can be based on nodes with a configured `role` instead of all nodes in the cluster,
i.e. using the oldest member (singleton) within the nodes with that role.
Configuration:
```
akka.cluster.split-brain-resolver.active-strategy=keep-oldest
```
@@snip [reference.conf](/akka-cluster/src/main/resources/reference.conf) { #keep-oldest }
### Down All
The strategy named `down-all` will down all nodes.
This strategy can be a safe alternative if the network environment is highly unstable with unreachability observations
that can't be fully trusted, and including frequent occurrences of @ref:[indirectly connected nodes](#indirectly-connected-nodes).
Due to the instability there is an increased risk of different information on different sides of partitions and
therefore the other strategies may result in conflicting decisions. In such environments it can be better to shutdown
all nodes and start up a new fresh cluster.
Shutting down all nodes means that the system will be completely unavailable until nodes have been restarted and
formed a new cluster. This strategy is not recommended for large clusters (> 10 nodes) because any minor problem
will shutdown all nodes, and that is more likely to happen in larger clusters since there are more nodes that
may fail.
See also @ref[Down all when unstable](#down-all-when-unstable) and @ref:[indirectly connected nodes](#indirectly-connected-nodes).
### Lease
The strategy named `lease-majority` is using a distributed lease (lock) to decide what nodes that are allowed to
survive. Only one SBR instance can acquire the lease make the decision to remain up. The other side will
not be able to aquire the lease and will therefore down itself.
Best effort is to keep the side that has most nodes, i.e. the majority side. This is achieved by adding a delay
before trying to acquire the lease on the minority side.
There is currently one supported implementation of the lease which is backed by a
[Custom Resource Definition (CRD)](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
in Kubernetes. It is described in the [Kubernetes Lease](https://doc.akka.io/docs/akka-management/current/kubernetes-lease.html)
documentation.
This strategy is very safe since coordination is added by an external arbiter. The trade-off compared to other
strategies is that it requires additional infrastructure for implementing the lease and it reduces the availability
of a decision to that of the system backing the lease store.
Similar to other strategies it is important that decisions are not deferred for too because the nodes that couldn't
acquire the lease must decide to down themselves, see @ref[Down all when unstable](#down-all-when-unstable).
In some cases the lease will be unavailable when needed for a decision from all SBR instances, e.g. because it is
on another side of a network partition, and then all nodes will be downed.
Configuration:
```
akka {
cluster {
downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
split-brain-resolver {
active-strategy = "lease-majority"
lease-majority {
lease-implementation = "akka.lease.kubernetes"
}
}
}
}
```
@@snip [reference.conf](/akka-cluster/src/main/resources/reference.conf) { #lease-majority }
See also configuration and additional dependency in [Kubernetes Lease](https://doc.akka.io/docs/akka-management/current/kubernetes-lease.html)
## Indirectly connected nodes
In a malfunctional network there can be situations where nodes are observed as unreachable via some network
links but they are still indirectly connected via other nodes, i.e. it's not a clean network partition (or node crash).
When this situation is detected the Split Brain Resolvers will keep fully connected nodes and down all the indirectly
connected nodes.
If there is a combination of indirectly connected nodes and a clean network partition it will combine the
above decision with the ordinary decision, e.g. keep majority, after excluding suspicious failure detection
observations.
## Down all when unstable
When reachability observations by the failure detector are changed the SBR decisions
are deferred until there are no changes within the `stable-after` duration.
If this continues for too long it might be an indication of an unstable system/network
and it could result in delayed or conflicting decisions on separate sides of a network
partition.
As a precaution for that scenario all nodes are downed if no decision is made within
`stable-after + down-all-when-unstable` from the first unreachability event.
The measurement is reset if all unreachable have been healed, downed or removed, or
if there are no changes within `stable-after * 2`.
This is enabled by default for all strategies and by default the duration is derived to
be 3/4 of `stable-after`.
The below property can be defined as a duration of for how long the changes are acceptable to
continue after the `stable-after` or it can be set to `off` to disable this feature.
```
akka.cluster.split-brain-resolver {
down-all-when-unstable = 15s
stable-after = 20s
}
```
@@@ warning
It is recommended to keep `down-all-when-unstable` enabled and not set it to a longer duration than `stable-after`
(`down-removal-margin`) because that can result in delayed decisions on the side that should have been downed, e.g.
in the case of a clean network partition followed by continued instability on the side that should be downed.
That could result in that members are removed from one side but are still running on the other side.
@@@
## Multiple data centers
Akka Cluster has @ref:[support for multiple data centers](cluster-dc.md), where the cluster
membership is managed by each data center separately and independently of network partitions across different
data centers. The Split Brain Resolver is embracing that strategy and will not count nodes or down nodes in
another data center.
When there is a network partition across data centers the typical solution is to wait the partition out until it heals, i.e.
do nothing. Other decisions should be performed by an external monitoring tool or human operator.
## Cluster Singleton and Cluster Sharding
The purpose of Cluster Singleton and Cluster Sharding is to run at most one instance
of a given actor at any point in time. When such an instance is shut down a new instance
is supposed to be started elsewhere in the cluster. It is important that the new instance is
not started before the old instance has been stopped. This is especially important when the
singleton or the sharded instance is persistent, since there must only be one active
writer of the journaled events of a persistent actor instance.
Since the strategies on different sides of a network partition cannot communicate with each other
and they may take the decision at slightly different points in time there must be a time based
margin that makes sure that the new instance is not started before the old has been stopped.
You would like to configure this to a short duration to have quick failover, but that will increase the
risk of having multiple singleton/sharded instances running at the same time and it may take a different
amount of time to act on the decision (dissemination of the down/removal). The duration is by default
the same as the `stable-after` property (see @ref:[Stable after](#stable-after) above). It is recommended to
leave this value as is, but it can also be separately overriden with the `akka.cluster.down-removal-margin` property.
Another concern for setting this `stable-after`/`akka.cluster.down-removal-margin` is dealing with JVM pauses e.g.
garbage collection. When a node is unresponsive it is not known if it is due to a pause, overload, a crash or a
network partition. If it is pause that lasts longer than `stable-after` * 2 it gives time for SBR to down the node
and for singletons and shards to be started on other nodes. When the node un-pauses there will be a short time before
it sees its self as down where singletons and sharded actors are still running. It is therefore important to understand
the max pause time your application is likely to incur and make sure it is smaller than `stable-margin`.
If you choose to set a separate value for `down-removal-margin`, the recommended minimum duration for different cluster sizes are:
|cluster size | down-removal-margin|
|-------------|--------------------|
|5 | 7 s |
|10 | 10 s|
|20 | 13 s|
|50 | 17 s|
|100 | 20 s|
|1000 | 30 s|
### Expected Failover Time
As you have seen, there are several configured timeouts that add to the total failover latency.
With default configuration those are:
* failure detection 5 seconds
* stable-after 20 seconds
* down-removal-margin (by default the same as stable-after) 20 seconds
In total, you can expect the failover time of a singleton or sharded instance to be around 45 seconds
with default configuration. The default configuration is sized for a cluster of 100 nodes. If you have
around 10 nodes you can reduce the `stable-after` to around 10 seconds,
resulting in an expected failover time of around 25 seconds.

View file

@ -108,8 +108,7 @@ performs in such a case must be designed in a way that all concurrent leaders wo
might be impossible in general and only feasible under additional constraints). The most important case of that kind is a split
brain scenario where nodes need to be downed, either manually or automatically, to bring the cluster back to convergence.
See the [Lightbend Split Brain Resolver](https://doc.akka.io/docs/akka-enhancements/current/split-brain-resolver.html)
for an implementation of that.
The @ref:[Split Brain Resolver](../split-brain-resolver.md) is the built-in implementation of that.
Another transition that is possible without convergence is marking members as `WeaklyUp` as described in the next section.
@ -140,7 +139,7 @@ startup if a node to join have been specified in the configuration
* **leave** - tell a node to leave the cluster gracefully, normally triggered by ActorSystem or JVM shutdown through @ref[coordinated shutdown](../coordinated-shutdown.md)
* **down** - mark a node as down. This action is required to remove crashed nodes (that did not 'leave') from the cluster. It can be triggered manually, through [Cluster HTTP Management](https://doc.akka.io/docs/akka-management/current/cluster-http-management.html#put-cluster-members-address-responses), or automatically by a @ref[downing provider](cluster.md#downing) like [Split Brain Resolver](https://doc.akka.io/docs/akka-enhancements/current/split-brain-resolver.html)
* **down** - mark a node as down. This action is required to remove crashed nodes (that did not 'leave') from the cluster. It can be triggered manually, through [Cluster HTTP Management](https://doc.akka.io/docs/akka-management/current/cluster-http-management.html#put-cluster-members-address-responses), or automatically by a @ref[downing provider](cluster.md#downing) like @ref:[Split Brain Resolver](../split-brain-resolver.md)
#### Leader Actions

View file

@ -275,23 +275,22 @@ new joining members to 'Up'. The node must first become `reachable` again, or th
status of the unreachable member must be changed to `Down`. Changing status to `Down`
can be performed automatically or manually.
By default, downing must be performed manually using @ref:[HTTP](../additional/operations.md#http) or @ref:[JMX](../additional/operations.md#jmx).
We recommend that you enable the @ref:[Split Brain Resolver](../split-brain-resolver.md) that is part of the
Akka Cluster module. You enable it with configuration:
```
akka.cluster.downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
```
You should also consider the different available @ref:[downing strategies](../split-brain-resolver.md#strategies).
If a downing provider is not configured downing must be performed manually using
@ref:[HTTP](../additional/operations.md#http) or @ref:[JMX](../additional/operations.md#jmx).
Note that @ref:[Cluster Singleton](cluster-singleton.md) or @ref:[Cluster Sharding entities](cluster-sharding.md) that
are running on a crashed (unreachable) node will not be started on another node until the previous node has
been removed from the Cluster. Removal of crashed (unreachable) nodes is performed after a downing decision.
A production solution for downing is provided by
[Split Brain Resolver](https://doc.akka.io/docs/akka-enhancements/current/split-brain-resolver.html),
which is part of the [Akka Platform](https://www.lightbend.com/akka-platform).
If you dont have a Lightbend Subscription, you should still carefully read the
[documentation](https://doc.akka.io/docs/akka-enhancements/current/split-brain-resolver.html)
of the Split Brain Resolver and make sure that the solution you are using handles the concerns and scenarios
described there.
A custom downing strategy can be implemented with a @apidoc[akka.cluster.DowningProvider] and enabled with
configuration `akka.cluster.downing-provider-class`.
Downing can also be performed programmatically with @scala[`Cluster(system).manager ! Down(address)`]@java[`Cluster.get(system).manager().tell(Down(address))`],
but that is mostly useful from tests and when implementing a `DowningProvider`.

View file

@ -18,7 +18,6 @@ With a [Lightbend Platform Subscription](https://www.lightbend.com/lightbend-sub
[Akka Resilience Enhancements](https://doc.akka.io/docs/akka-enhancements/current/akka-resilience-enhancements.html):
* [Split Brain Resolver](https://doc.akka.io/docs/akka-enhancements/current/split-brain-resolver.html) &#8212; Detects and recovers from network partitions, eliminating data inconsistencies and possible downtime.
* [Configuration Checker](https://doc.akka.io/docs/akka-enhancements/current/config-checker.html) &#8212; Checks for potential configuration issues and logs suggestions.
* [Diagnostics Recorder](https://doc.akka.io/docs/akka-enhancements/current/diagnostics-recorder.html) &#8212; Captures configuration and system information in a format that makes it easy to troubleshoot issues during development and production.
* [Thread Starvation Detector](https://doc.akka.io/docs/akka-enhancements/current/starvation-detector.html) &#8212; Monitors an Akka system dispatcher and logs warnings if it becomes unresponsive.

View file

@ -25,6 +25,7 @@ project.description: Akka Cluster concepts, node membership service, CRDT Distri
* [multi-node-testing](../multi-node-testing.md)
* [remoting-artery](../remoting-artery.md)
* [remoting](../remoting.md)
* [split-brain-resolver](../split-brain-resolver.md)
* [coordination](../coordination.md)
* [choosing-cluster](choosing-cluster.md)

View file

@ -0,0 +1,105 @@
/*
* Copyright (C) 2019-2020 Lightbend Inc. <https://www.lightbend.com>
*/
package jdocs.coordination;
import akka.actor.ActorSystem;
import akka.coordination.lease.LeaseSettings;
import akka.coordination.lease.javadsl.Lease;
import akka.coordination.lease.javadsl.LeaseProvider;
import akka.testkit.javadsl.TestKit;
import docs.coordination.LeaseDocSpec;
import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;
import org.scalatestplus.junit.JUnitSuite;
import java.util.Optional;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.CompletionStage;
import java.util.function.Consumer;
@SuppressWarnings("unused")
public class LeaseDocTest extends JUnitSuite {
// #lease-example
static class SampleLease extends Lease {
private LeaseSettings settings;
public SampleLease(LeaseSettings settings) {
this.settings = settings;
}
@Override
public LeaseSettings getSettings() {
return settings;
}
@Override
public CompletionStage<Boolean> acquire() {
return CompletableFuture.completedFuture(true);
}
@Override
public CompletionStage<Boolean> acquire(Consumer<Optional<Throwable>> leaseLostCallback) {
return CompletableFuture.completedFuture(true);
}
@Override
public CompletionStage<Boolean> release() {
return CompletableFuture.completedFuture(true);
}
@Override
public boolean checkLease() {
return true;
}
}
// #lease-example
private static ActorSystem system;
@BeforeClass
public static void setup() {
system = ActorSystem.create("LeaseDocTest", LeaseDocSpec.config());
}
@AfterClass
public static void teardown() {
TestKit.shutdownActorSystem(system);
system = null;
}
private void doSomethingImportant(Optional<Throwable> leaseLostReason) {}
@Test
public void javaLeaseBeLoadableFromJava() {
// #lease-usage
Lease lease =
LeaseProvider.get(system).getLease("<name of the lease>", "jdocs-lease", "<owner name>");
CompletionStage<Boolean> acquired = lease.acquire();
boolean stillAcquired = lease.checkLease();
CompletionStage<Boolean> released = lease.release();
// #lease-usage
// #lost-callback
lease.acquire(this::doSomethingImportant);
// #lost-callback
// #cluster-owner
// String owner = Cluster.get(system).selfAddress().hostPort();
// #cluster-owner
}
@Test
public void scalaLeaseBeLoadableFromJava() {
Lease lease =
LeaseProvider.get(system).getLease("<name of the lease>", "docs-lease", "<owner name>");
CompletionStage<Boolean> acquired = lease.acquire();
boolean stillAcquired = lease.checkLease();
CompletionStage<Boolean> released = lease.release();
lease.acquire(this::doSomethingImportant);
}
}

View file

@ -0,0 +1,102 @@
/*
* Copyright (C) 2019-2020 Lightbend Inc. <https://www.lightbend.com>
*/
package docs.coordination
import scala.concurrent.Future
import com.typesafe.config.ConfigFactory
import akka.cluster.Cluster
import akka.coordination.lease.LeaseSettings
import akka.coordination.lease.scaladsl.Lease
import akka.coordination.lease.scaladsl.LeaseProvider
import akka.testkit.AkkaSpec
//#lease-example
class SampleLease(settings: LeaseSettings) extends Lease(settings) {
override def acquire(): Future[Boolean] = {
Future.successful(true)
}
override def acquire(leaseLostCallback: Option[Throwable] => Unit): Future[Boolean] = {
Future.successful(true)
}
override def release(): Future[Boolean] = {
Future.successful(true)
}
override def checkLease(): Boolean = {
true
}
}
//#lease-example
object LeaseDocSpec {
def config() =
ConfigFactory.parseString("""
jdocs-lease.lease-class = "jdocs.coordination.LeaseDocTest$SampleLease"
#lease-config
akka.actor.provider = cluster
docs-lease {
lease-class = "docs.coordination.SampleLease"
heartbeat-timeout = 100s
heartbeat-interval = 1s
lease-operation-timeout = 1s
# Any lease specific configuration
}
#lease-config
""".stripMargin)
def blackhole(stuff: Any*): Unit = {
stuff.toString
()
}
def doSomethingImportant(leaseLostReason: Option[Throwable]): Unit = {
leaseLostReason.map(_.toString)
()
}
}
class LeaseDocSpec extends AkkaSpec(LeaseDocSpec.config) {
import LeaseDocSpec._
"A docs lease" should {
"scala lease be loadable from scala" in {
//#lease-usage
val lease = LeaseProvider(system).getLease("<name of the lease>", "docs-lease", "owner")
val acquired: Future[Boolean] = lease.acquire()
val stillAcquired: Boolean = lease.checkLease()
val released: Future[Boolean] = lease.release()
//#lease-usage
//#lost-callback
lease.acquire(leaseLostReason => doSomethingImportant(leaseLostReason))
//#lost-callback
//#cluster-owner
val owner = Cluster(system).selfAddress.hostPort
//#cluster-owner
// remove compiler warnings
blackhole(acquired, stillAcquired, released, owner)
}
"java lease be loadable from scala" in {
val lease = LeaseProvider(system).getLease("<name of the lease>", "jdocs-lease", "owner")
val acquired: Future[Boolean] = lease.acquire()
val stillAcquired: Boolean = lease.checkLease()
val released: Future[Boolean] = lease.release()
lease.acquire(leaseLostReason => doSomethingImportant(leaseLostReason))
blackhole(acquired, stillAcquired, released)
}
}
}