Commit graph

51 commits

Author SHA1 Message Date
Veiga Ortiz, Héctor
c08bc317e2 +clu #13584 Accept joining to be WeaklyUp during network split
* experimental feature, disabled by default
* Adding documentation to mention weakly up members.
  plus adding new diagram.
2015-09-04 12:44:47 +02:00
Roland Kuhn
18688fc84b = #17380 fix doc comments for java8 doclint
* actor and cluster-metrics comments
* agent/camel/cluster/osgi/persistence/remote comments
* comments in contrib/persistence-tck/multi-node/typed
2015-05-18 12:51:36 +02:00
Patrik Nordwall
c991d5f1d1 =str #17200 Stop shard region when MemberRemoved
Two issues:

1) ShardRegion actor must stop itself when the node is shutting down,
   ie. when receiving MemberRemoved(selfAddress)
2) ShardCoordinator must not persist anything when the node is shutting
   down. MemberRemoved of other shard regions will trigger Terminated,
   which must not be persisted, because then the next coordinator will
   replay those events and end up in wrong state. This is a problem
   announced itself when using leaving as illustrated in the new test.

To solve the second issue I have added a new ClusterShuttingDown event
that is published before the MemberRemoved events. Note that Terminated
is triggered by MemberRemoved.

(cherry picked from commit 1b272c72597beece9d93f0054f4b58e3d25f9ae2)
2015-04-22 12:46:30 +02:00
Patrik Nordwall
fe98dae650 =clu #13875 Fix regression in leader selection
* The leader is selected by picking the first reachable member, but in
  #13875 we had to let the self member be unreachable in the Reachability
  table and that was not considered in the logic of the leader selection.
* That means changed behavior that is unwanted, especially when there
  is only one node left the leader could be evaluated to None instead
  of Some(selfUniqueAddress).
* Note that #13875 has not been released yet.
2015-03-14 11:41:28 -07:00
Julian Tescher
00f6a58e7c Changes all occurances of Typesafe copyright to extend to 2015 2015-03-10 14:12:19 -07:00
Patrik Nordwall
71ccb4c21b =clu #13875 Exclude unreachability observations from downed
* Skip observations from downed node (quarantined is marked down immediately)
  in convergence check
* Skip observations from downed node when picking "reachable" targets for gossip.
* This also means that we must accept gossip with own node marked as unreachable,
  but that should not be spread to the external membership events.
2015-02-06 10:19:48 +01:00
Andrei Pozolotin
7b9f77a073 + akka-cluster-metrics: new akka module
* new akka module split from akka-cluster
* provide sigar provisioning
* fix ewma usage
* resolve #16121
* see #16354
2015-01-19 10:23:54 -06:00
Patrik Nordwall
503c4ced8f !clu #3920 Remove deprecated Cluster.publishCurrentClusterState 2014-03-14 14:11:28 +01:00
dario.rexin
2cbad298d6 =all #3858 Make case classes final 2014-03-07 13:20:01 +01:00
Adam Voss
cce29dfa51 Changes all occurances of Typesafe copyright to extend to 2014. 2014-02-04 21:20:09 -06:00
Patrik Nordwall
2e5193347e !clu #3617 API improvements related to CurrentClusterState
* Getter for CurrentClusterState in Cluster extension, updated via
  ClusterReadView
* Remove lazy init of readView. Otherwise the cluster.state will be
  empty on first access, wich is probably surprising
* Subscribe to several cluster event types at once, to ensure *one*
  CurrentClusterEvent followed by change events
* Deprecate publishCurrentClusterState, was a bad idea, use sendCurrentClusterState
  instead
* Possibility to subscribe with InitialStateAsEvents to receive events corresponding
  to CurrentClusterState
* CurrentClusterState not a ClusterDomainEvent, ticket #3614
2014-01-16 16:17:44 +01:00
Patrik Nordwall
dc9fe4f19c !clu #2307 Allow transition from unreachable to reachable
* Replace unreachable Set with Reachability table
* Unreachable members stay in member Set
* Downing a live member was moved it to the unreachable Set,
  and then removed from there by the leader. That will not
  work when flipping back to reachable, so a Down member must
  be detected as unreachable before beeing removed. Similar
  to Exiting. Member shuts down itself if it sees itself as
  Down.
* Flip back to reachable when failure detector monitors it as
  available again
* ReachableMember event
* Can't ignore gossip from aggregated unreachable (see SurviveNetworkInstabilitySpec)
* Make use of ReachableMember event in cluster router
* End heartbeat when acknowledged, EndHeartbeatAck
* Remove nr-of-end-heartbeats from conf
* Full reachability info in JMX cluster status
* Don't use interval after unreachable for AccrualFailureDetector history
* Add QuarantinedEvent to remoting, used for Reachability.Terminated
* Prune reachability table when all reachable
* Update documentation
* Performance testing and optimizations
2013-09-11 13:10:29 +02:00
Patrik Nordwall
a323936299 Disable cluster stats by default, see #3348
* Add VectorClockStats
2013-05-28 16:15:57 +02:00
Patrik Nordwall
ee6e80d31a Add previousStatus in MemberRemoved, see #3252 2013-05-23 11:09:32 +02:00
Patrik Nordwall
a0a0f39613 Hardening of cluster member leaving path, see #3309
* Removed leader commands for Shutdown and Exit
* Member shutdown itself  when it sees itself as Exiting
* Singleton cluster with status Exiting will shutdown itself,
  in case the Exiting gossip never arrives
* Exiting member not part convergence check
* Exiting member is removed by leader (on convergence) when the
  exiting member is in the unreachable set, i.e. sucessfully shutdown
* Reverted the change made for #3266, i.e. Exiting is
  detected as unreachable again.
* Adjust ClusterSingletonManager to new Exiting behaviour
* Fix bug in HeartbeatSender, which caused it to continue to
  send heartbeats to removed nodes, instead of rebalancing
* Refactoring of leaderActions method
* Leaving section in docs
2013-05-17 11:39:49 +02:00
Björn Antonsson
539df2e98a Enforce mailbox types on System actors. See #3273 2013-05-03 11:05:32 +02:00
Patrik Nordwall
4606612bd1 Reliable remote supervision and death watch, see #2993
* RemoteWatcher that monitors node failures, with heartbeats
  and failure detector
* Move RemoteDeploymentWatcher from CARP to RARP
* ClusterRemoteWatcher that handles cluster nodes
* Update documentation
* UID in Heartbeat msg to be able to quarantine,
  actual implementation of quarantining will be implemented
  in ticket 2594
2013-04-17 19:42:51 +02:00
Patrik Nordwall
9e56ab6fe5 Disallow re-joining, see #2873
* Disallow join requests when already part of a cluster
* Remove wipe state when joining, since join can only be
  performed from empty state
* When trying to join, only accept gossip from that member
* Ignore gossips from unknown (and unreachable) members
* Make sure received gossip contains selfAddress
* Test join of fresh node with same host:port
* Remove JoinTwoClustersSpec
* Welcome message as reply to Join
* Retry unsucessful join request
* AddressUidExtension
* Uid in cluster Member identifier
  To be able to distinguish nodes with same host:port
  after restart.
* Ignore gossip with wrong uid
* Renamed Remove command to Shutdown
* Use uid in vclock identifier
* Update sample, Member apply is private
* Disabled config duration syntax and cleanup of io settings
* Update documentation
2013-04-17 16:48:18 +02:00
Björn Antonsson
73f0f44ddb Protobuf serialization of cluster messages. See #1910 2013-04-11 10:09:05 +02:00
Patrik Nordwall
7eac88f372 Cluster node roles, see #3049
* Config of node roles cluster.role
* Cluster router configurable with use-role
* RoleLeaderChanged event
* Cluster singleton per role
* Cluster only starts once all required per-role node
  counts are reached,
  role.<role-name>.min-nr-of-members config
*  Update documentation and make use of the roles in the examples
2013-03-18 11:56:11 +01:00
Patrik Nordwall
1e4b2585c7 Publish LeaderChanged when first seen, see #3131
* The problem in ClusterSingletonManagerChaosSpec was that node 4 doesn't publish
  LeaderChanged, because there is never convergence on node 4 of the new Up
  state for the three new nodes before they are shutdown. When it becomes
  convergence on node 4 prevConvergedGossip and newGossip have same leader
  (i.e. no change).
* LeaderChanged is now published when the new leader is first seen, i.e. same
  as member events. This makes sense now when leader can't be in Joining state.
2013-03-11 12:41:15 +01:00
Patrik Nordwall
5b844ec1e6 Publish member events when state change first seen, see #3075
* Remove InstantMemberEvent
2013-03-07 14:07:17 +01:00
Patrik Nordwall
5c7747e7fa Transition from Down to Removed, see #3075 2013-03-07 14:02:42 +01:00
Roland
bcfbea42c1 fix formatting of Java API in doc comments + genjavadoc 0.3 2013-03-07 09:05:55 +01:00
Patrik Nordwall
cab78e5174 Make cluster fault handling more robust, see #3030
* ClusterCoreDaemon and ClusterDomainEventPublisher can't be restarted
  because the state would be obsolete.
* Add extra supervisor level for ClusterCoreDaemon and
  ClusterDomainEventPublisher, which will shutdown the member
  on failure in children.
* Publish the final removed state on postStop in
  ClusterDomainEventPublisher. This also simplifies the removing
  process.
2013-02-12 21:55:08 +01:00
Patrik Nordwall
d32a2edc51 Buffer LeaderChanged events and publish all on convergence, see #3017
* Otherwise some changes might never be published, since it doesn't have
  to be convergence on all nodes inbetween all transitions.
* Detected by a failure ClusterSingletonManagerSpec.
* Added a test to simulate the failure scenario.
2013-02-08 12:29:11 +01:00
Patrik Nordwall
79303a1785 Incorparate review comments, see #2803 2013-01-14 19:32:52 +01:00
Patrik Nordwall
d07f331e78 Publish InstantMemberEvent immediately, see #2803 2013-01-14 19:13:48 +01:00
Viktor Klang (√)
6b638db65e Merge pull request #1006 from akka/wip-2879-copyright2013-√
#2879 - updating copyright info
2013-01-14 04:59:29 -08:00
Viktor Klang
adfeb2c1f0 #2879 - updating copyright info 2013-01-09 11:38:00 +01:00
Patrik Nordwall
943c438d5e Publish clean state when joining (PublishStart), see #2871
* The failure in JoinTwoClustersSpec was due to missing publishing
  of cluster events when clearing current state when joining
* This fix is in the right direction, but joining clusters like this
  will need some design thought, creating ticket 2873 for that
2013-01-08 19:32:36 +01:00
Patrik Nordwall
f147f4d3d2 Stress / long running test of cluster, see #2786
* akka.cluster.StressSpec
* Configurable number of nodes and duration for each step
* Report metrics and phi periodically to see progress
* Configurable payload size
* Test of various join and remove scenarios
* Test of watch
* Exercise supervision
* Report cluster stats
* Test with many actors in tree structure

Apart from the test this commit also solves some issues:

* Avoid adding back members when downed in ClusterHeartbeatSender
* Avoid duplicate close of ClusterReadView
* Add back the publish of AddressTerminated when MemberDowned/Removed
  it was lost in merge of "publish on convergence", see #2779
2013-01-07 14:44:36 +01:00
Björn Antonsson
a03460329d Change cluster MemberEvents to only be published on convergence. See #2692
Conflicts:
	akka-cluster/src/main/scala/akka/cluster/ClusterEvent.scala
	akka-cluster/src/main/scala/akka/cluster/ClusterJmx.scala
	akka-cluster/src/main/scala/akka/cluster/ClusterMetricsCollector.scala
	akka-cluster/src/main/scala/akka/cluster/ClusterReadView.scala
	akka-cluster/src/multi-jvm/scala/akka/cluster/MultiNodeClusterSpec.scala
	akka-docs/rst/cluster/cluster-usage-java.rst
	akka-docs/rst/cluster/cluster-usage-scala.rst
	akka-kernel/src/main/dist/bin/akka-cluster
2012-12-14 12:46:13 +01:00
Patrik Nordwall
1cd3a05f41 Publish AddressTerminated after a member is Downed/Removed, see #2779
* Instead of when unreachable

* Note that ClusterRouterConfig is not changed, i.e. routees will be removed
  when unreachable
* Routers that are not wrapped by ClusterRouterConfig will watch as usual, i.e.
  remove routees when Terminated, i.e. node down
2012-12-12 12:55:22 +01:00
Patrik Nordwall
1914be7069 Merge branch 'master' into wip-2547-metrics-router-patriknw
Conflicts:
	akka-actor/src/main/scala/akka/actor/Deployer.scala
	akka-cluster/src/main/scala/akka/cluster/ClusterMetricsCollector.scala
	akka-cluster/src/test/scala/akka/cluster/MetricsCollectorSpec.scala
2012-11-15 12:33:11 +01:00
Patrik Nordwall
dcde7d3594 AdaptiveLoadBalancingRouter and more refactoring of metrics, see #2547
* Refactoring of standard metrics extractors and data structures
* Removed optional value in Metric, simplified a lot
* Configuration of EWMA by using half-life duration
* Renamed DataStream to EWMA
* Incorporate review feedback
* Use binarySearch for selecting weighted routees
* More metrics selectors for the router
* Removed network metrics, since not supported on linux
* Configuration of router
* Rename to AdaptiveLoadBalancingRouter
* Remove total cores metrics, since it's the same as jmx getAvailableProcessors,
  tested on intel 24 core server and amd 48 core server, and MBP
* API cleanup
* Java API additions
* Documentation of metrics and AdaptiveLoadBalancingRouter
* New cluster sample to illustrate metrics in the documentation,
  and play around with (factorial)
2012-11-14 15:08:30 +01:00
Patrik Nordwall
c959d4a973 Incorporate feedback, see #2502 2012-10-05 08:17:54 +02:00
Patrik Nordwall
acdafa0cd3 Additions for Java API of cluster, see #2502 2012-10-04 14:16:11 +02:00
Patrik Nordwall
49b9ec6c2c Publish cluster metrics through the publisher actor.
* To avoid ordering surprises metrics should be published via
  the same actor that handles the subscriptions and publishes
  other cluster domain events.
* Added missing publish in case of removal of member
  (had a test failure for that)
2012-10-02 17:08:38 +02:00
Patrik Nordwall
51ff9ce6d1 Cluster.unsubscribe with class parameter, see #2567 2012-09-28 13:09:36 +02:00
Helena Edelson
dbce1c8b85 Cluster metrics internal API and cluster-wide transport of metrics data.
* Create Cluster Metrics API
* Create transport of relevant metrics data
Does not include load-balancing routers.
2012-09-24 13:07:11 -06:00
Patrik Nordwall
9423d37da9 Merge branch 'master' into wip-cluster-docs-patriknw
Conflicts:
	project/AkkaBuild.scala
2012-09-20 10:40:08 +02:00
Patrik Nordwall
ab8a690c65 Use Either for LeaderChanged state, see #2518 2012-09-20 08:44:44 +02:00
Patrik Nordwall
718686e2f2 Add another test case for publish of LeaderChanged, see #2518
* It didn't handle convergence changes with same leader correctly
2012-09-19 10:18:55 +02:00
Patrik Nordwall
c0c6cc3931 Publish cluster LeaderChanged only when convergence, see #2518 2012-09-18 14:19:38 +02:00
Patrik Nordwall
50d0efe7d4 Request send/publish of CurrentClusterState, see #2438
* Added publishCurrentClusterState and sendCurrentClusterState
* Removed Ping/Pong that was used for some tests, since awaitCond is
  now needed anyway, since publish to eventStream is done afterwards
2012-09-12 09:23:02 +02:00
Patrik Nordwall
911ef6b97e Merge pull request #668 from akka/wip-1588-cluster-death-watch-patriknw
Death watch hooked up with cluster failure detector, see #1588
2012-09-11 06:13:44 -07:00
Patrik Nordwall
bd6c39178c Fix leaking this in constructor of Cluster, see #2473
* Major refactoring to remove the need to use special
  Cluster instance for testing. Use default Cluster
  extension instead. Most of it is trivial changes.
* Used failure-detector.implementation-class from config
  to swap to Puppet
* Removed FailureDetectorStrategy, since it doesn't add any value
* Added Cluster.joinSeedNodes to be able to test seedNodes when Addresses
  are unknown before startup time.
* Removed ClusterEnvironment that was passed around among the actors,
  instead they use the ordinary Cluster extension.
* Overall much cleaner design
2012-09-06 21:48:40 +02:00
Patrik Nordwall
6b40ddc755 Maintain AddressTerminated subscription in DeathWatch, see #1588 2012-09-03 20:37:33 +02:00
Patrik Nordwall
b1e251e0bc Prototype of death watch hooked up with failure detector, see #1588
* Probably a lot of things missing, but wanted to try the first idea
* The test is green :)
2012-08-31 16:37:35 +02:00