* instead of using transport failure detector
* add a new config property akka.remote.handshake-timeout, but
for netty.tcp and netty.ssl the existing netty.tcp.connection-timeout
setting will be used
* add test of the timeouts
* mima filter for internal ProtocolStateActor
* avoid using Down and Exiting member from being used for joining
* delay shut down of Down member until the information is spread
to all reachable members, e.g. downing several nodes via one node
* akka.cluster.down-removal-margin setting
Margin until shards or singletons that belonged to a
downed/removed partition are created in surviving partition.
Used by singleton and sharding.
* remove the retry count parameters/settings for singleton in
favor of deriving those from the removal-margin
This improves the remote watching mechanism as follows: Watch requests
are intercepted by the RemoteWatcher and not sent on the wire,
excepted watches from the remoteWatcher itself.
RemoteWatcher is then in charge of forwarding DeathWatchNotification
messages to the watchers.
This reduces the number of watch message to one per watchee, even if
there are several watcher on the same watchee (instead of n+1 before).
Reversed watch messages, and watch on ref with undefinedUid are excluded from
interception by the RemoteWatcher and so are handled as before this commit.
In addition, the following changes are made:
- Keep watchers in a map watchee -> watchers for more efficient retrieval
(in a scala Multimap)
- Keep watchees in a map address -> watchee for more efficient retrieval
(in a scala Multimap)
- Use of InternalActorRef more thoroughly to avoid casts
- Rewatch use a standard watch message, as the distinction is longer needed
+ enable parallel execution
+ exclude perf tests (TODO mark more as such)
+ uses sbt-dependency-graph plugin
+ implement dependency tracking for testing of only these
+ project which could have been affected by a given PR
* When new uid is seen in join attempt we can down existing
member and thereby new restarted node will be able to join
in later retried join attempt without relying on auto-down.
* The problem was that the sys msg buffer was filled up during
the deploy phase and triggered quarantine too early and therefore
the "hello" reply was lost. The "hello" ping-pong was not good
enough for deploying one-by-one.
(cherry picked from commit f729afe1fa5401e562655e5a0aaab3f9789e4df6)
Conflicts:
akka-cluster/src/multi-jvm/scala/akka/cluster/SurviveNetworkInstabilitySpec.scala
* Otherwise the leader might stall (cannot remove downed nodes)
if many nodes are shutdown at the same time and nobody in the
remaining cluster is monitoring some of the shutdown nodes.
(cherry picked from commit 1354524c4fde6f40499833bdd4c0edd479e6f906)
Conflicts:
akka-cluster/src/main/scala/akka/cluster/ClusterHeartbeat.scala
project/AkkaBuild.scala
This is an API breaking change if someone implemented their own Routers.
The change is required because the router must know if the local routees
should be started or not so it has to check the roles of the cluster
member (the local one). We could delay this decision of starting local
routees, but that would allow messages to be dead-letter-ed (bad).
* deprecates awaitTermination, shutdown and isTerminated
* introduces a terminate-method that returns a Future[Unit]
* introduces a whenTerminated-method that returns a Future[Unit]
* simplifies the implementation by removing blocking constructs
* adds tests for terminate() and whenTerminated
* The problem was that the unreachability observed by second node
was leaking from previous test step and when adding the blackhole,
it could not heal and that caused the leader to not be able to remove
the downed second node because some other nodes were still marked as
unreachable.
* The first node was not included in the the awaitAllReachable check
in the previous step, and the order of awaitAllReachable and
awaitMembersUp was wrong.
* Included the awaitAllReachable check in assertCanTalk.
* Changed to two-way blackhole and using barrier instead of scheduled
event to trigger the exceptions when the blackhole was in place
* We should investigate if unreachable observations from downed node
can be excluded in the convergence check. Created separate ticket for
that 3875.
* It did not use the toString (including full address of destination) of the
node entries, instead it used the hashCode which always included the self
address
* This was a regression in 2.3, it is correct in 2.2.3
* The Identify message didn't get through to the master, which
was stopping at the same time, and it didn't got redirected to
deadletters, i.e. the "termination race"
* because it is not referentially transparent; normally we reserved parens for
side-effecting code but given how people thoughtlessly close over it we revised
that that decision for sender
* caller can still omit parens
- removed retry-window and related settings
- removed gate-invalid-addresses-for
- gate is now mandatory
- remoting has a dedicated dispatcher by default
- updated tests to work with changed timings
- added doc section for association lifecycle
* The previous one-way hearbeat was elegant, but comlicated to
understand and without giving much extra value compared to this approach.
* The previous one-way heartbeat have some kind of bug when joining
several (10-20) nodes at approximately the same time (but not exactly
the same time) with a false failure detection triggered by the extra heartbeat,
which would not heal.
* This ping-pong approach will increase network traffic slightly, but heartbeat
messages are small and each node is limited to monitor (default) 5 peers.