As documented in the code:
// Leader is moving itself from Leaving to Exiting. Let others know (best effort)
// before shutdown. Otherwise they will not see the Exiting state change
// and there will not be convergence until they have detected this node as
// unreachable and the required downing has finished. They will still need to detect
// unreachable, but Exiting unreachable will be removed without downing, i.e.
// normally the leaving of a leader will be graceful without the need
// for downing. However, if those final gossip messages never arrive it is
// alright to require the downing, because that is probably caused by a
// network failure anyway.
That is fine, but this change improves the selection of the nodes to
send the final gossip messages to.
I could reproduce the failure in ClusterSingletonManagerLeaveSpec and with
additional logging I verified that in the failure case it picked the "first"
node 3 times (it's random) and that node had already been shutdown (left earlier
in the test) but was not removed yet.
* image-liveness-timeout must be less than the handshake-timeout,
otherwise the publication for the handshake will give up too early
when previous image is still considered alive
* they can't be stopped immediately because we want to send
some final message and we reply to inbound messages with `Quarantined`
* and improve logging
* Setting to configure where the flight recorder puts its file
* Run ArteryMultiNodeSpecs with flight recorder enabled
* More cleanup in exit hook, wait for task runner to stop
* Enable flight recorder for the cluster multi node tests
* Enable flight recorder for multi node remoting tests
* Toggle always-dump flight recorder output when akka.remote.artery.always-dump-flight-recorder is set
* need to use a shared media driver to get the cpu usage
at a reasonable level
* also changed to SleepingIdleStrategy(1 ms) when cpu-level=1
not needed for the test to pass, but can be good to make level 1
more extreme
* Don't quarantine the other system when receiving the Quarantined message,
since that will result cluster member removal and can result in
forming two separate clusters (cluster split).
* Instead, the downing strategy should act on ThisActorSystemQuarantinedEvent, e.g.
use it as a STONITH signal.
* track nodes by UniqueAddress in Cluster Singleton, #20942
* reply with HandOverDone from new incarnation, #20942
* confirm as terminated immediately when new incarnation joins, #20942 instead of waiting for failure detector to mark it as unreachable this will speed-up removal when restarting cluster node with same hostname:port
* Provide shorter aliases for the ActorRefProviders #20649
* Use the new actorefprovider aliases throughout code and docs
* Cleaner alias replacement logic
* Automatic port selection when port 0 configured
* Combine remoting and artery SunnyWeatherSpec
* Default to port 0 for artery in MultiNodeSpec.nodeConfig
* minor fixes
* remove now superfluous buffer from MultipartUnmarshaller
* remove unused TokenSourceActor
* remove FIXME: add tests, see #16437
* removed unused param remoteAddress (comment: TODO: remove after #16168 is cleared)
* convert FIXME to TODO (#18709)
* reenable tests in {Request|Response}RendererSpec due to fixed#15981
* remove logging workaround in StreamTestDefaultMailbox due to fixed#15947