* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
* phase config obj with depends-on list
* integrate graceful leaving of sharding in coordinated shutdown
* add timeout and recover
* add some missing artery ports to tests
* leave via CoordinatedShutdown.run
* optionally exit-jvm in last phase
* run via jvm shutdown hook
* send ExitingConfirmed to leader before shutdown of Exiting
to not have to wait for failure detector to mark it as
unreachable before removing
* the unreachable signal is still kept as a safe guard if
message is lost or leader dies
* PhaseClusterExiting vs MemberExited in ClusterSingletonManager
* terminate ActorSystem when cluster shutdown (via Down)
* add more predefined and custom phases
* reference documentation
* migration guide
* problem when the leader order was sys2, sys1, sys3,
then sys3 could not perform it's duties and move Leving sys1 to
Exiting because it was observing sys1 as unreachable
* exclude Leaving with exitingConfirmed from convergence condidtion
* Verify that it actually fails with classic remoting
if vector clocks are not pruned
* Make it pass with Artery, but it is not verifying
the message sizes yet. We should implement that
with a custom RemoteInstrument, but that can be done
in separate PR.
* Still pending with Artery because it still fails on jenkins
* barrier after sys shutdown
(cherry picked from commit d5edcbea35ca5b43ca4cfb3018602dd555402f42)
* speedup ActorCreationPerfSpec
* reduce iterations in ConsistencySpec
* tag SupervisorHierarchySpec as LongRunningTest
* various small speedups and tagging in actor-tests
* speedup expectNoMsg in stream-tests
* tag FramingSpec, and reduce iterations
* speedup QueueSourceSpec
* tag some stream-tests
* reduce iterations in persistence.PerformanceSpec
* reduce iterations in some cluster perf tests
* tag RemoteWatcherSpec
* tag InterpreterStressSpec
* remove LongRunning from ClusterConsistentHashingRouterSpec
* sys property to disable multi-jvm tests in test
* actually disable multi-node tests in validatePullRequest
* doc sbt flags in CONTRIBUTING
* image-liveness-timeout must be less than the handshake-timeout,
otherwise the publication for the handshake will give up too early
when previous image is still considered alive
* Setting to configure where the flight recorder puts its file
* Run ArteryMultiNodeSpecs with flight recorder enabled
* More cleanup in exit hook, wait for task runner to stop
* Enable flight recorder for the cluster multi node tests
* Enable flight recorder for multi node remoting tests
* Toggle always-dump flight recorder output when akka.remote.artery.always-dump-flight-recorder is set
* need to use a shared media driver to get the cpu usage
at a reasonable level
* also changed to SleepingIdleStrategy(1 ms) when cpu-level=1
not needed for the test to pass, but can be good to make level 1
more extreme
* Don't quarantine the other system when receiving the Quarantined message,
since that will result cluster member removal and can result in
forming two separate clusters (cluster split).
* Instead, the downing strategy should act on ThisActorSystemQuarantinedEvent, e.g.
use it as a STONITH signal.
* Provide shorter aliases for the ActorRefProviders #20649
* Use the new actorefprovider aliases throughout code and docs
* Cleaner alias replacement logic
* Automatic port selection when port 0 configured
* Combine remoting and artery SunnyWeatherSpec
* Default to port 0 for artery in MultiNodeSpec.nodeConfig
* The problem: ACK that was targeted to an old incarnation
was sent to the new, restarted, system with same host:port, and
therefore resulting issues noticed as
"Error encountered while processing system message acknowledgement buffer: [-1 {}] ack: ACK[0, {}]"
when restarting actor system
* The reason:
1. The endpoint reader was about to send OutgoingAck to parent reader,
targeted to the old system.
2. At the same time there is an incoming connection from new system
that triggered TakeOver in the endpoint writer, i.e. replacing
the handle to the connection of the new system.
3. The OutgoingAck is received by the writer, which happily sends it
to the new handle, the new system.
* The solution: Ignore OutgoingAck during the handoff (TakeOver) process.
* Failure detection heartbeating was not performed to joining
nodes, since it was expected that they will become Up first.
* If a joining node is downed before it is changed to Up failure
detection will not be performed for that node. That resulted in
the downed node will not be removed from membership, since the
unreachability signal is used as confirmation that the node is
actually stopped before removing it.
* the reported issue is fixed by the immediate leaderActions
(moving to Up) when joining the first node to itself
* the other changes are precautions just in case
* instead of using transport failure detector
* add a new config property akka.remote.handshake-timeout, but
for netty.tcp and netty.ssl the existing netty.tcp.connection-timeout
setting will be used
* add test of the timeouts
* mima filter for internal ProtocolStateActor