* Config of node roles cluster.role
* Cluster router configurable with use-role
* RoleLeaderChanged event
* Cluster singleton per role
* Cluster only starts once all required per-role node
counts are reached,
role.<role-name>.min-nr-of-members config
* Update documentation and make use of the roles in the examples
* The scenario was that previous leader left.
* The problem was that the new leader got MemberRemoved
before it got the HandOverDone and therefore missed the
hand over data.
* Solved by not changing the singleton to leader when receiving
MemberRemoved and instead do that on normal HandOverDone or
in failure cases after retry timeout.
* The reason for this bug was the new transition from Down to
Removed and that there is now no MemberDowned event. Previously
this was only triggered by MemberDowned (not MemberRemoved) and
that was safe because that was "always" preceeded by unreachable.
* The new solution means that it will take longer for new singleton
to startup in case of unreachable previous leader, but I don't
want to trigger it on MemberUnreachable because it might in the
future be possible to switch it back to reachable.
* Problem may occur when joining member with same hostname:port again,
after downing.
* Reproduced with StressSpec exerciseJoinRemove with fixed port that
joins and shutdown several times.
* Real solution for this will be covered by ticket #2788 by adding
uid to member identifier, but as first step we need to support
this scenario with current design.
* Use unique node identifier for vector clock to avoid mixup of
old and new member instance.
* Support transition from Down to Joining in Gossip merge
* Don't gossip to unknown or unreachable members.
* ClusterCoreDaemon and ClusterDomainEventPublisher can't be restarted
because the state would be obsolete.
* Add extra supervisor level for ClusterCoreDaemon and
ClusterDomainEventPublisher, which will shutdown the member
on failure in children.
* Publish the final removed state on postStop in
ClusterDomainEventPublisher. This also simplifies the removing
process.
* Previous work-around was introduced because Netty blocks when sending
to broken connections. This is supposed to be solved by the non-blocking
new remoting.
* Removed HeartbeatSender and CoreSender in cluster
* Added tests to verify that broken connections don't disturb live connection
* When a def starts with if and is not a oneliner the if
should be on a new line.
* The reason is that it might be easy to miss the if when
reading the code.
* Subscribe to InstantMemberEvent and start heartbeating when
InstantMemberUp. Same for metrics.
* HeartbeatNodeRing data structure for bidirectional mapping of
heartbeat sender and receiver. Not using ConsistentHash anymore.
Node addresses are hashed to ensure that neighbors are spread out.
* HeartbeatRequest when receiver detects that it has not received
expected heartbeats.
* New test InitialHeartbeatSpec that simulates the problem
* Add/remove some related conf properties
* Add some more logging to be able to diagnose eventual problems
* Explicit config of nr-of-end-heartbeats
* The failure in JoinTwoClustersSpec was due to missing publishing
of cluster events when clearing current state when joining
* This fix is in the right direction, but joining clusters like this
will need some design thought, creating ticket 2873 for that
* Renamed isRunning to isTerminated (with negation of course)
* Removed Running from JMX API, since the mbean is deregistered anyway
* Cleanup isAvailable, isUnavailbe
* Misc minor
* Previously heartbeat messages was sent to all other members, i.e.
each member was monitored by all other members in the cluster.
* This was the number one know scalability bottleneck, due to the
number of interconnections.
* Limit sending of heartbeats to a few (5) members. Select and
re-balance with consistent hashing algorithm when new members
are added or removed.
* Send a few EndHeartbeat when ending send of Heartbeat messages.
* To avoid ordering surprises metrics should be published via
the same actor that handles the subscriptions and publishes
other cluster domain events.
* Added missing publish in case of removal of member
(had a test failure for that)
* Added publishCurrentClusterState and sendCurrentClusterState
* Removed Ping/Pong that was used for some tests, since awaitCond is
now needed anyway, since publish to eventStream is done afterwards
* Major refactoring to remove the need to use special
Cluster instance for testing. Use default Cluster
extension instead. Most of it is trivial changes.
* Used failure-detector.implementation-class from config
to swap to Puppet
* Removed FailureDetectorStrategy, since it doesn't add any value
* Added Cluster.joinSeedNodes to be able to test seedNodes when Addresses
are unknown before startup time.
* Removed ClusterEnvironment that was passed around among the actors,
instead they use the ordinary Cluster extension.
* Overall much cleaner design
* Defined the domain events in ClusterEvent.scala file
* Produce events from diff and publish publish to event bus
from separate actor, ClusterDomainEventPublisher
* Adjustments of tests
* Gossip is not exposed in user api
* Better and more events
* Snapshot event sent to new subscriber
* Updated tests
* Periodic publish only for internal stats
* Implemented without ScatterGatherFirstCompletedRouter, since
that is more straightforward and might cause less confusion
* Added more description of what it does