* The problem was that the unreachability observed by second node
was leaking from previous test step and when adding the blackhole,
it could not heal and that caused the leader to not be able to remove
the downed second node because some other nodes were still marked as
unreachable.
* The first node was not included in the the awaitAllReachable check
in the previous step, and the order of awaitAllReachable and
awaitMembersUp was wrong.
* Included the awaitAllReachable check in assertCanTalk.
* Changed to two-way blackhole and using barrier instead of scheduled
event to trigger the exceptions when the blackhole was in place
* We should investigate if unreachable observations from downed node
can be excluded in the convergence check. Created separate ticket for
that 3875.
* It did not use the toString (including full address of destination) of the
node entries, instead it used the hashCode which always included the self
address
* This was a regression in 2.3, it is correct in 2.2.3
* The Identify message didn't get through to the master, which
was stopping at the same time, and it didn't got redirected to
deadletters, i.e. the "termination race"
* because it is not referentially transparent; normally we reserved parens for
side-effecting code but given how people thoughtlessly close over it we revised
that that decision for sender
* caller can still omit parens
- removed retry-window and related settings
- removed gate-invalid-addresses-for
- gate is now mandatory
- remoting has a dedicated dispatcher by default
- updated tests to work with changed timings
- added doc section for association lifecycle
* The previous one-way hearbeat was elegant, but comlicated to
understand and without giving much extra value compared to this approach.
* The previous one-way heartbeat have some kind of bug when joining
several (10-20) nodes at approximately the same time (but not exactly
the same time) with a false failure detection triggered by the extra heartbeat,
which would not heal.
* This ping-pong approach will increase network traffic slightly, but heartbeat
messages are small and each node is limited to monitor (default) 5 peers.
* Separate routing logic, to be usable stand alone, e.g. in actors
* Simplify RouterConfig, only a factory
* Move reading of config from Deployer to the RouterConfig
* Distiction between Pool and Group router types
* Remove usage of actorFor, use ActorSelection
* Management messages to add and remove routees
* Simplify the internals of RoutedActorCell & co
* Move resize specific code to separate RoutedActorCell subclass
* Change resizer api to only return capacity change
* Resizer only allowed together with Pool
* Re-implement all routers, and keep old api during deprecation phase
* Replace ClusterRouterConfig, deprecation
* Rewrite documentation
* Migration guide
* Also includes related ticket:
+act #3087 Create nicer Props factories for RouterConfig
* Revert the change introduced in
https://github.com/akka/akka/pull/1738/files
* The cleanup/improvements aside of the actual
feature is not reverted by this patch
* Clarify the documentation
* Replace (deprecate) akka.cluster.auto-down config setting with
akka.cluster.auto-down-unreachable-after
* AutoDown actor that keeps track of unreachable members
and performs down from the leader node when they have been
unreachable for the specified duration
* Migration guide
* It was a regression introduced in dc9fe4f
* Two problems:
1) Gossip merge could pop back removed member (was previously
covered by the filter of unreachable)
2) Reachability merge didn't handle all cases for removed member,
i.e. when node not in allowed set
* Replace unreachable Set with Reachability table
* Unreachable members stay in member Set
* Downing a live member was moved it to the unreachable Set,
and then removed from there by the leader. That will not
work when flipping back to reachable, so a Down member must
be detected as unreachable before beeing removed. Similar
to Exiting. Member shuts down itself if it sees itself as
Down.
* Flip back to reachable when failure detector monitors it as
available again
* ReachableMember event
* Can't ignore gossip from aggregated unreachable (see SurviveNetworkInstabilitySpec)
* Make use of ReachableMember event in cluster router
* End heartbeat when acknowledged, EndHeartbeatAck
* Remove nr-of-end-heartbeats from conf
* Full reachability info in JMX cluster status
* Don't use interval after unreachable for AccrualFailureDetector history
* Add QuarantinedEvent to remoting, used for Reachability.Terminated
* Prune reachability table when all reachable
* Update documentation
* Performance testing and optimizations
* Removed leader commands for Shutdown and Exit
* Member shutdown itself when it sees itself as Exiting
* Singleton cluster with status Exiting will shutdown itself,
in case the Exiting gossip never arrives
* Exiting member not part convergence check
* Exiting member is removed by leader (on convergence) when the
exiting member is in the unreachable set, i.e. sucessfully shutdown
* Reverted the change made for #3266, i.e. Exiting is
detected as unreachable again.
* Adjust ClusterSingletonManager to new Exiting behaviour
* Fix bug in HeartbeatSender, which caused it to continue to
send heartbeats to removed nodes, instead of rebalancing
* Refactoring of leaderActions method
* Leaving section in docs
* Assign internal upNumber when member is moved to Up
* Public API Member.isOlder
* Change cluster singleton to use oldest member instead of leader
* Update samples and docs
* When seen same the gossip chat is initated with GossipStatus
message containing the vclock only
* Remove conversation flag in GossipEnvelope
* Ordinary tell instead of actorSelection when replying
* UnreachableNodeJoinsAgain failed because of gated connection
* Removed default test value of retry-gate-closed-for, instead
default from reference.conf is used, i.e. 0s
* deadLetters logging love