* they can't be stopped immediately because we want to send
some final message and we reply to inbound messages with `Quarantined`
* and improve logging
* Failure detection heartbeating was not performed to joining
nodes, since it was expected that they will become Up first.
* If a joining node is downed before it is changed to Up failure
detection will not be performed for that node. That resulted in
the downed node will not be removed from membership, since the
unreachability signal is used as confirmation that the node is
actually stopped before removing it.
When using a dispatcher (default or separate cluster dispatcher)
with less than 5 threads the Cluster extension initialization
could deadlock.
It was reproducable by adding a sleep before the Await of GetClusterCoreRef
in the Cluster extension constructor. The reason was that other cluster actors were
started too early and they also tried to get the Cluster extension and thereby blocking
dispatcher threads.
Note that the Cluster extension is started via ClusterActorRefProvider before
ActorSystem.apply returns.
The improvement is to start the cluster child actors lazily when the
GetClusterCoreRef is received.
* sysmsg.Terminate, sysmsg.DeathWatchNotification, io.Tcp.Closed
were needed to silence normal usage of http client/server
* other things based on jenkins logs, but not a complete audit
(cherry picked from commit 270e3b2f49af3c34fd5ea4c3bcfd8257402b5cbe)
* Otherwise the leader might stall (cannot remove downed nodes)
if many nodes are shutdown at the same time and nobody in the
remaining cluster is monitoring some of the shutdown nodes.
(cherry picked from commit 1354524c4fde6f40499833bdd4c0edd479e6f906)
Conflicts:
akka-cluster/src/main/scala/akka/cluster/ClusterHeartbeat.scala
project/AkkaBuild.scala
* Replace stash with internal bufferi, j.u.LinkedList
* Replace FSM with become
* Adaptive backoff, important to backoff, but not for too long,
depends on environment and use case
* Prioritize heartbeat messages from remote watcher and cluster
failure detector
* Use payload messages as heartbeats for transport failure detector,
change transport failure detector to be based on absolute timeout,
see ticket #13989 and #13742
* Log remote disassociate from transport failure detector,
see ticket #13985
* Add benchmark sample in akka-sample-remote-scala
* because it is not referentially transparent; normally we reserved parens for
side-effecting code but given how people thoughtlessly close over it we revised
that that decision for sender
* caller can still omit parens
* The previous one-way hearbeat was elegant, but comlicated to
understand and without giving much extra value compared to this approach.
* The previous one-way heartbeat have some kind of bug when joining
several (10-20) nodes at approximately the same time (but not exactly
the same time) with a false failure detection triggered by the extra heartbeat,
which would not heal.
* This ping-pong approach will increase network traffic slightly, but heartbeat
messages are small and each node is limited to monitor (default) 5 peers.
* Replace unreachable Set with Reachability table
* Unreachable members stay in member Set
* Downing a live member was moved it to the unreachable Set,
and then removed from there by the leader. That will not
work when flipping back to reachable, so a Down member must
be detected as unreachable before beeing removed. Similar
to Exiting. Member shuts down itself if it sees itself as
Down.
* Flip back to reachable when failure detector monitors it as
available again
* ReachableMember event
* Can't ignore gossip from aggregated unreachable (see SurviveNetworkInstabilitySpec)
* Make use of ReachableMember event in cluster router
* End heartbeat when acknowledged, EndHeartbeatAck
* Remove nr-of-end-heartbeats from conf
* Full reachability info in JMX cluster status
* Don't use interval after unreachable for AccrualFailureDetector history
* Add QuarantinedEvent to remoting, used for Reachability.Terminated
* Prune reachability table when all reachable
* Update documentation
* Performance testing and optimizations
* Removed leader commands for Shutdown and Exit
* Member shutdown itself when it sees itself as Exiting
* Singleton cluster with status Exiting will shutdown itself,
in case the Exiting gossip never arrives
* Exiting member not part convergence check
* Exiting member is removed by leader (on convergence) when the
exiting member is in the unreachable set, i.e. sucessfully shutdown
* Reverted the change made for #3266, i.e. Exiting is
detected as unreachable again.
* Adjust ClusterSingletonManager to new Exiting behaviour
* Fix bug in HeartbeatSender, which caused it to continue to
send heartbeats to removed nodes, instead of rebalancing
* Refactoring of leaderActions method
* Leaving section in docs
* Deprecate all actorFor methods
* resolveActorRef in provider
* Identify auto receive message
* Support ActorPath in actorSelection
* Support remote actor selections
* Additional tests of actor selection
* Update tests (keep most actorFor tests)
* Update samples to use actorSelection
* Updates to documentation
* Migration guide, including motivation
* Previous work-around was introduced because Netty blocks when sending
to broken connections. This is supposed to be solved by the non-blocking
new remoting.
* Removed HeartbeatSender and CoreSender in cluster
* Added tests to verify that broken connections don't disturb live connection
* When a def starts with if and is not a oneliner the if
should be on a new line.
* The reason is that it might be easy to miss the if when
reading the code.
* Subscribe to InstantMemberEvent and start heartbeating when
InstantMemberUp. Same for metrics.
* HeartbeatNodeRing data structure for bidirectional mapping of
heartbeat sender and receiver. Not using ConsistentHash anymore.
Node addresses are hashed to ensure that neighbors are spread out.
* HeartbeatRequest when receiver detects that it has not received
expected heartbeats.
* New test InitialHeartbeatSpec that simulates the problem
* Add/remove some related conf properties
* Add some more logging to be able to diagnose eventual problems
* Explicit config of nr-of-end-heartbeats
* akka.cluster.StressSpec
* Configurable number of nodes and duration for each step
* Report metrics and phi periodically to see progress
* Configurable payload size
* Test of various join and remove scenarios
* Test of watch
* Exercise supervision
* Report cluster stats
* Test with many actors in tree structure
Apart from the test this commit also solves some issues:
* Avoid adding back members when downed in ClusterHeartbeatSender
* Avoid duplicate close of ClusterReadView
* Add back the publish of AddressTerminated when MemberDowned/Removed
it was lost in merge of "publish on convergence", see #2779