* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
* phase config obj with depends-on list
* integrate graceful leaving of sharding in coordinated shutdown
* add timeout and recover
* add some missing artery ports to tests
* leave via CoordinatedShutdown.run
* optionally exit-jvm in last phase
* run via jvm shutdown hook
* send ExitingConfirmed to leader before shutdown of Exiting
to not have to wait for failure detector to mark it as
unreachable before removing
* the unreachable signal is still kept as a safe guard if
message is lost or leader dies
* PhaseClusterExiting vs MemberExited in ClusterSingletonManager
* terminate ActorSystem when cluster shutdown (via Down)
* add more predefined and custom phases
* reference documentation
* migration guide
* problem when the leader order was sys2, sys1, sys3,
then sys3 could not perform it's duties and move Leving sys1 to
Exiting because it was observing sys1 as unreachable
* exclude Leaving with exitingConfirmed from convergence condidtion
* to be able to introduce new messages and still support rolling upgrades,
i.e. a cluster of mixed versions
* note that it's only catching NotSerializableException, which we already
use for unknown serializer ids and class manifests
* note that it is not catching for system messages, since that could result
in infinite resending
* in the failed test it was noticed that a Down member removed
itself in leaderActionsOnConvergence which resulted in
later "Failed to serialize Gossip, Unknown address"
* never use member with status Down as leader
* a node will anyway shutdown itself when it's Down,
but leader actions could happen before that
* Verify that it actually fails with classic remoting
if vector clocks are not pruned
* Make it pass with Artery, but it is not verifying
the message sizes yet. We should implement that
with a custom RemoteInstrument, but that can be done
in separate PR.
* Still pending with Artery because it still fails on jenkins
* barrier after sys shutdown
(cherry picked from commit d5edcbea35ca5b43ca4cfb3018602dd555402f42)
* speedup ActorCreationPerfSpec
* reduce iterations in ConsistencySpec
* tag SupervisorHierarchySpec as LongRunningTest
* various small speedups and tagging in actor-tests
* speedup expectNoMsg in stream-tests
* tag FramingSpec, and reduce iterations
* speedup QueueSourceSpec
* tag some stream-tests
* reduce iterations in persistence.PerformanceSpec
* reduce iterations in some cluster perf tests
* tag RemoteWatcherSpec
* tag InterpreterStressSpec
* remove LongRunning from ClusterConsistentHashingRouterSpec
* sys property to disable multi-jvm tests in test
* actually disable multi-node tests in validatePullRequest
* doc sbt flags in CONTRIBUTING
As documented in the code:
// Leader is moving itself from Leaving to Exiting. Let others know (best effort)
// before shutdown. Otherwise they will not see the Exiting state change
// and there will not be convergence until they have detected this node as
// unreachable and the required downing has finished. They will still need to detect
// unreachable, but Exiting unreachable will be removed without downing, i.e.
// normally the leaving of a leader will be graceful without the need
// for downing. However, if those final gossip messages never arrive it is
// alright to require the downing, because that is probably caused by a
// network failure anyway.
That is fine, but this change improves the selection of the nodes to
send the final gossip messages to.
I could reproduce the failure in ClusterSingletonManagerLeaveSpec and with
additional logging I verified that in the failure case it picked the "first"
node 3 times (it's random) and that node had already been shutdown (left earlier
in the test) but was not removed yet.
* image-liveness-timeout must be less than the handshake-timeout,
otherwise the publication for the handshake will give up too early
when previous image is still considered alive
* they can't be stopped immediately because we want to send
some final message and we reply to inbound messages with `Quarantined`
* and improve logging
* Setting to configure where the flight recorder puts its file
* Run ArteryMultiNodeSpecs with flight recorder enabled
* More cleanup in exit hook, wait for task runner to stop
* Enable flight recorder for the cluster multi node tests
* Enable flight recorder for multi node remoting tests
* Toggle always-dump flight recorder output when akka.remote.artery.always-dump-flight-recorder is set
* need to use a shared media driver to get the cpu usage
at a reasonable level
* also changed to SleepingIdleStrategy(1 ms) when cpu-level=1
not needed for the test to pass, but can be good to make level 1
more extreme
* Don't quarantine the other system when receiving the Quarantined message,
since that will result cluster member removal and can result in
forming two separate clusters (cluster split).
* Instead, the downing strategy should act on ThisActorSystemQuarantinedEvent, e.g.
use it as a STONITH signal.