* Problem may occur when joining member with same hostname:port again,
after downing.
* Reproduced with StressSpec exerciseJoinRemove with fixed port that
joins and shutdown several times.
* Real solution for this will be covered by ticket #2788 by adding
uid to member identifier, but as first step we need to support
this scenario with current design.
* Use unique node identifier for vector clock to avoid mixup of
old and new member instance.
* Support transition from Down to Joining in Gossip merge
* Don't gossip to unknown or unreachable members.
* ClusterCoreDaemon and ClusterDomainEventPublisher can't be restarted
because the state would be obsolete.
* Add extra supervisor level for ClusterCoreDaemon and
ClusterDomainEventPublisher, which will shutdown the member
on failure in children.
* Publish the final removed state on postStop in
ClusterDomainEventPublisher. This also simplifies the removing
process.
* Previous work-around was introduced because Netty blocks when sending
to broken connections. This is supposed to be solved by the non-blocking
new remoting.
* Removed HeartbeatSender and CoreSender in cluster
* Added tests to verify that broken connections don't disturb live connection
* When a def starts with if and is not a oneliner the if
should be on a new line.
* The reason is that it might be easy to miss the if when
reading the code.
* Subscribe to InstantMemberEvent and start heartbeating when
InstantMemberUp. Same for metrics.
* HeartbeatNodeRing data structure for bidirectional mapping of
heartbeat sender and receiver. Not using ConsistentHash anymore.
Node addresses are hashed to ensure that neighbors are spread out.
* HeartbeatRequest when receiver detects that it has not received
expected heartbeats.
* New test InitialHeartbeatSpec that simulates the problem
* Add/remove some related conf properties
* Add some more logging to be able to diagnose eventual problems
* Explicit config of nr-of-end-heartbeats
* The failure in JoinTwoClustersSpec was due to missing publishing
of cluster events when clearing current state when joining
* This fix is in the right direction, but joining clusters like this
will need some design thought, creating ticket 2873 for that
* Renamed isRunning to isTerminated (with negation of course)
* Removed Running from JMX API, since the mbean is deregistered anyway
* Cleanup isAvailable, isUnavailbe
* Misc minor
* Previously heartbeat messages was sent to all other members, i.e.
each member was monitored by all other members in the cluster.
* This was the number one know scalability bottleneck, due to the
number of interconnections.
* Limit sending of heartbeats to a few (5) members. Select and
re-balance with consistent hashing algorithm when new members
are added or removed.
* Send a few EndHeartbeat when ending send of Heartbeat messages.
* To avoid ordering surprises metrics should be published via
the same actor that handles the subscriptions and publishes
other cluster domain events.
* Added missing publish in case of removal of member
(had a test failure for that)
* Added publishCurrentClusterState and sendCurrentClusterState
* Removed Ping/Pong that was used for some tests, since awaitCond is
now needed anyway, since publish to eventStream is done afterwards
* Major refactoring to remove the need to use special
Cluster instance for testing. Use default Cluster
extension instead. Most of it is trivial changes.
* Used failure-detector.implementation-class from config
to swap to Puppet
* Removed FailureDetectorStrategy, since it doesn't add any value
* Added Cluster.joinSeedNodes to be able to test seedNodes when Addresses
are unknown before startup time.
* Removed ClusterEnvironment that was passed around among the actors,
instead they use the ordinary Cluster extension.
* Overall much cleaner design
* Defined the domain events in ClusterEvent.scala file
* Produce events from diff and publish publish to event bus
from separate actor, ClusterDomainEventPublisher
* Adjustments of tests
* Gossip is not exposed in user api
* Better and more events
* Snapshot event sent to new subscriber
* Updated tests
* Periodic publish only for internal stats
* Implemented without ScatterGatherFirstCompletedRouter, since
that is more straightforward and might cause less confusion
* Added more description of what it does
* Can't reproduce with the actor based Cluster, it was easy
to reproduce before.
* If it still is a problem it will be detected by the runtime
assert. It will show up as:
IllegalArgumentException: Nodes not part of cluster have marked the Gossip as seen