* It was a regression introduced in dc9fe4f
* Two problems:
1) Gossip merge could pop back removed member (was previously
covered by the filter of unreachable)
2) Reachability merge didn't handle all cases for removed member,
i.e. when node not in allowed set
* Replace unreachable Set with Reachability table
* Unreachable members stay in member Set
* Downing a live member was moved it to the unreachable Set,
and then removed from there by the leader. That will not
work when flipping back to reachable, so a Down member must
be detected as unreachable before beeing removed. Similar
to Exiting. Member shuts down itself if it sees itself as
Down.
* Flip back to reachable when failure detector monitors it as
available again
* ReachableMember event
* Can't ignore gossip from aggregated unreachable (see SurviveNetworkInstabilitySpec)
* Make use of ReachableMember event in cluster router
* End heartbeat when acknowledged, EndHeartbeatAck
* Remove nr-of-end-heartbeats from conf
* Full reachability info in JMX cluster status
* Don't use interval after unreachable for AccrualFailureDetector history
* Add QuarantinedEvent to remoting, used for Reachability.Terminated
* Prune reachability table when all reachable
* Update documentation
* Performance testing and optimizations
* When seen same the gossip chat is initated with GossipStatus
message containing the vclock only
* Remove conversation flag in GossipEnvelope
* Ordinary tell instead of actorSelection when replying
* UnreachableNodeJoinsAgain failed because of gated connection
* Removed default test value of retry-gate-closed-for, instead
default from reference.conf is used, i.e. 0s
* deadLetters logging love
* Rename config akka.event-handlers to akka.loggers
* Rename config akka.event-handler-startup-timeout to
akka.logger-startup-timeout
* Rename JulEventHandler to JavaLogger
* Rename Slf4jEventHandler to Slf4jLogger
* Change all places in tests and docs
* Deprecation, old still works, but with warnings
* Migration guide
* Test for the deprecated event-handler config
* Failure detector was previously copied with refactoring to
akka-remote and this refactoring makes use of that and removes
the failure detector in akka-cluster
* Adjustments to reference.conf
* Refactoring of FailureDetectorPuppet
* When a def starts with if and is not a oneliner the if
should be on a new line.
* The reason is that it might be easy to miss the if when
reading the code.
* Subscribe to InstantMemberEvent and start heartbeating when
InstantMemberUp. Same for metrics.
* HeartbeatNodeRing data structure for bidirectional mapping of
heartbeat sender and receiver. Not using ConsistentHash anymore.
Node addresses are hashed to ensure that neighbors are spread out.
* HeartbeatRequest when receiver detects that it has not received
expected heartbeats.
* New test InitialHeartbeatSpec that simulates the problem
* Add/remove some related conf properties
* Add some more logging to be able to diagnose eventual problems
* Explicit config of nr-of-end-heartbeats
* akka.cluster.StressSpec
* Configurable number of nodes and duration for each step
* Report metrics and phi periodically to see progress
* Configurable payload size
* Test of various join and remove scenarios
* Test of watch
* Exercise supervision
* Report cluster stats
* Test with many actors in tree structure
Apart from the test this commit also solves some issues:
* Avoid adding back members when downed in ClusterHeartbeatSender
* Avoid duplicate close of ClusterReadView
* Add back the publish of AddressTerminated when MemberDowned/Removed
it was lost in merge of "publish on convergence", see #2779
It's a glitch in how ClusterReadView (used in tests) is updated.
First we do awaitUpConvergence, which checks readView.members and
readView.convergence. Then we do assertLeader, which checks
readView.leader. The problem is that readView.leader might not
have been updated yet (if there already was convergence).
Solution is to await the expected leader in awaitUpConvergence,
so that readView is in a consistent state after awaitUpConvergence.
We will change the semantics in ticket #2692, but this change will
be needed and should work with that as well.
* MetricsSelector, calculate capacity, weights and allocate weighted
routee refs
* ClusterLoadBalancingRouterSpec
* Optional heap max
* Constants for the metric fields
* Refactoring of Metric and decay
* Rewrite of DataStreamSpec
* Correction of EWMA and removal of BigInt, BigDecimal
* Separation of MetricsCollector into trait and two classes,
SigarMetricsCollector and JmxMetricsCollector
* This will reduce cost when sigar is not installed, such as
avoiding throwing and catching exc for every call
* Improved error handling for loading sigar
* Made MetricsCollector implementation configurable
* Tested with sigar
* Major refactoring to remove the need to use special
Cluster instance for testing. Use default Cluster
extension instead. Most of it is trivial changes.
* Used failure-detector.implementation-class from config
to swap to Puppet
* Removed FailureDetectorStrategy, since it doesn't add any value
* Added Cluster.joinSeedNodes to be able to test seedNodes when Addresses
are unknown before startup time.
* Removed ClusterEnvironment that was passed around among the actors,
instead they use the ordinary Cluster extension.
* Overall much cleaner design
* We will write more tests that rely on real Cluster(system) extension,
such as ClusterRoundRobinRoutedActorSpec
* When not using FailureDetectorStrategy or overriding seed nodes
MultiNodeClusterSpec will use the real Cluster(system) extension
instead of a new Cluster instance with additional test facilities
* Gossip is not exposed in user api
* Better and more events
* Snapshot event sent to new subscriber
* Updated tests
* Periodic publish only for internal stats