* Guarantee no sneaky type puts more teams in the role list
* Leader per team and initial tests
* MiMa filters
* Second iteration (not working though)
* Verbose gossip logging etc.
* Gossip to team-nodes even if there is inter-team unreachability
* More work ...
* Marking removed nodes with tombstones in Gossip
* More test coverage for Gossip.remove
* Bug failing other multi-node tests squashed
* Multi-node test for team-split
* Review fixes - only prune tombstones on leader ticks
* Clean code is happy code.
* All I want is for MiMa to be my friend
* These constants are internal
* Making the formatting gods happy
* I used the wrong reachability for ignoring gossip :/
* Still hadn't quite gotten how reachability was supposed to work
* Review feedback applied
* Cross-team downing should still work
* Actually prune tombstones in the prune tombstones method ...
* Another round against reachability. Reachability leading with 15 - 2 so far.
Introduced cluster-team.md so we can grow the documentation with each
PR, but did not add it to the ToC yet.
(cherry picked from commit a06badaa03fa9f3c9a942b1468090f758c74a869)
* Introduce cluster 'team' setting and add to Member
Introduced cluster-team.md so we can grow the documentation with each
PR, but did not add it to the ToC yet.
* Less abbreviations, more reliable test
* properly shutdown ArteryTransport using CoordinatedShutdown, #22671
* The shutdownHook changed hasBeenShutdown flag to true, and then when
the transport.shutdown was invoked the shutdown sequence was ignored
until it was too late, ActorSystem already terminated.
* Also improved the cluster shutdown tasks when the cluster node had not
joined
* CoordinatedShutdownLeave explicit events
* re-implement javadsl testkit
* fix mima problem
* rebase master
* move ImplicitSender/DefaultTimeout to scaladsl
* undo the change of moving scala api
* fix return type and add doc
* resolve conflicts and add more comments
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
* phase config obj with depends-on list
* integrate graceful leaving of sharding in coordinated shutdown
* add timeout and recover
* add some missing artery ports to tests
* leave via CoordinatedShutdown.run
* optionally exit-jvm in last phase
* run via jvm shutdown hook
* send ExitingConfirmed to leader before shutdown of Exiting
to not have to wait for failure detector to mark it as
unreachable before removing
* the unreachable signal is still kept as a safe guard if
message is lost or leader dies
* PhaseClusterExiting vs MemberExited in ClusterSingletonManager
* terminate ActorSystem when cluster shutdown (via Down)
* add more predefined and custom phases
* reference documentation
* migration guide
* problem when the leader order was sys2, sys1, sys3,
then sys3 could not perform it's duties and move Leving sys1 to
Exiting because it was observing sys1 as unreachable
* exclude Leaving with exitingConfirmed from convergence condidtion
* to be able to introduce new messages and still support rolling upgrades,
i.e. a cluster of mixed versions
* note that it's only catching NotSerializableException, which we already
use for unknown serializer ids and class manifests
* note that it is not catching for system messages, since that could result
in infinite resending
* in the failed test it was noticed that a Down member removed
itself in leaderActionsOnConvergence which resulted in
later "Failed to serialize Gossip, Unknown address"
* never use member with status Down as leader
* a node will anyway shutdown itself when it's Down,
but leader actions could happen before that
* Verify that it actually fails with classic remoting
if vector clocks are not pruned
* Make it pass with Artery, but it is not verifying
the message sizes yet. We should implement that
with a custom RemoteInstrument, but that can be done
in separate PR.
* Still pending with Artery because it still fails on jenkins
* barrier after sys shutdown
(cherry picked from commit d5edcbea35ca5b43ca4cfb3018602dd555402f42)
* speedup ActorCreationPerfSpec
* reduce iterations in ConsistencySpec
* tag SupervisorHierarchySpec as LongRunningTest
* various small speedups and tagging in actor-tests
* speedup expectNoMsg in stream-tests
* tag FramingSpec, and reduce iterations
* speedup QueueSourceSpec
* tag some stream-tests
* reduce iterations in persistence.PerformanceSpec
* reduce iterations in some cluster perf tests
* tag RemoteWatcherSpec
* tag InterpreterStressSpec
* remove LongRunning from ClusterConsistentHashingRouterSpec
* sys property to disable multi-jvm tests in test
* actually disable multi-node tests in validatePullRequest
* doc sbt flags in CONTRIBUTING
As documented in the code:
// Leader is moving itself from Leaving to Exiting. Let others know (best effort)
// before shutdown. Otherwise they will not see the Exiting state change
// and there will not be convergence until they have detected this node as
// unreachable and the required downing has finished. They will still need to detect
// unreachable, but Exiting unreachable will be removed without downing, i.e.
// normally the leaving of a leader will be graceful without the need
// for downing. However, if those final gossip messages never arrive it is
// alright to require the downing, because that is probably caused by a
// network failure anyway.
That is fine, but this change improves the selection of the nodes to
send the final gossip messages to.
I could reproduce the failure in ClusterSingletonManagerLeaveSpec and with
additional logging I verified that in the failure case it picked the "first"
node 3 times (it's random) and that node had already been shutdown (left earlier
in the test) but was not removed yet.