Test would fail picking up the reachable from the previous unsplit
as it is a new probe.
Also change barrierCounter to split/unsplit so easier to see
where the failure is on a barrier fail
* When leaving/downing the last node in a DC it would not
be removed in another DC, since that was only done by the
leader in the owning DC (and that is gone).
* It should be ok to eagerly remove such nodes also by
leaders in other DCs.
* Note that gossip is already sent out so for the last node
that will be spread to other DC, unless there is a network
partition. For that we can't do anything. It will be replaced
if joining again.
There exists a race where a cluter node that is being downed seens its
self as the oldest node (as it has had the other nodes removed) and it
takes over the singleton manager sending the real oldest node to go into
the End state meaning that cluster singletons never work again.
This fix simply prevents Member events being given to the Cluster
Manager FSM during a shut down, instread relying on SelfExiting.
This also hardens the test by not downing the node that the current
sharding coordinator is running on as well as fixing a bug in the
probes.
The last time this failed there was no gossip to or from a node that
didn't see fifth coming back.
Also note that this test doesn't quite test what it says as the split
brain is repaired before starting the second actor system but without
extensions to the multi jvm test kit this can't be improved.
Refs #23306
The test has been failing infrequently as when we get to the final
barrier (restarted-fifth-removed) the whole test withIn of 40s
has been reached so the last barrier times out right away.
Trying to remove the Thread.sleep and rely on a larger timeout for the
whole test as well as the default barrier timeout of 30s.
* looks like the ActorSystem is shutdown when leaving
* Included in MultiNodeSpec, i.e. all multi-node tests:
akka.coordinated-shutdown.terminate-actor-system = off
akka.oordinated-shutdown.run-by-jvm-shutdown-hook = off
* Adjust cross DC gossip probability for small nr of nodes in a DC
When a Dc is being bootstrapped the initial node has no local peers and
can not gossip if it selects a local gossip round. Start at a
probability of 1.0 for a single node cluster and move down 0.25 per node
until a 5 node DC is reached then use the cross-data-center-gossip-probability
* Fix cross DC gossip selecting of oldest members
This used to select the members based on the sort order members in
Gossip (by address) rather than by upNumber
* MemberRemoved must be published before MemberUp, e.g. when restarted
in other DC
* remove from failureDetector when receiving gossip with new member,
not only new joining member
* increase timeout in MultiDcSingletonManagerSpec
* Cluster management (join, leave, etc)
* Cluster membership subscriptions (MemberUp, MemberRemoved, etc)
* New SelfUp and SelfRemoved events
* change signature of awaitAssert to return the value (not binary compatible)
* Cluster singleton api
* the crossDcFailureDetector was not connected to the reachability table
* additional test by listen for {Reachable/Unreachable}DataCenter events in split spec
* missing Java API for getUnreachableDataCenters in CurrentClusterState