Commit graph

121 commits

Author SHA1 Message Date
Patrik Nordwall
23fa8b0810 change spelling of behaviour to behavior, #24457 2018-02-01 15:10:46 +01:00
Christopher Batey
5a37cdc862 Cross DC gossip fixes #23803
* Adjust cross DC gossip probability for small nr of nodes in a DC
When a Dc is being bootstrapped the initial node has no local peers and
can not gossip if it selects a local gossip round. Start at a
probability of 1.0 for a single node cluster and move down 0.25 per node
until a 5 node DC is reached then use the cross-data-center-gossip-probability
* Fix cross DC gossip selecting of oldest members
This used to select the members based on the sort order members in
Gossip (by address) rather than by upNumber
2017-11-02 09:17:24 +01:00
Patrik Nordwall
4f8856f108 Merge pull request #23551 from akka/wip-23502-join-timeout-patriknw
Add timeout to abort joining of seed nodes, #23502
2017-09-11 16:41:35 +02:00
Patrik Nordwall
5cf698a2f6 Add timeout to abort joining of seed nodes, #23502 2017-09-11 15:56:25 +02:00
Patrik Nordwall
1e4e7cbba2 Merge pull request #23583 from akka/wip-multi-dc-merge-master-patriknw
merge wip-multi-dc-dev back to master
2017-09-01 17:08:28 +02:00
Patrik Nordwall
6ed3295acd Merge branch 'master' into wip-multi-dc-merge-master-patriknw 2017-08-31 10:51:12 +02:00
Patrik Nordwall
6753c1e624 Don't use WeaklyUp immediately, #23554
* see description in issue
2017-08-22 12:02:04 +02:00
Sébastien Lorion
a95a94acff Replace ClusterRouterGroup/Pool "use-role" with "use-role-set" #23496 2017-08-09 16:06:18 +02:00
Johan Andrén
9c7e8d027a Renamed/moved the self data center setting #23312 (#23344) 2017-07-12 11:47:32 +01:00
Johan Andrén
c0d439eac3 limit cross dc gossip #23282 2017-07-07 13:19:10 +01:00
Konrad `ktoso` Malawski
b568975acc =clu #23229 multi-dc heartbeating, only N nodes perform monitoring 2017-07-07 12:17:41 +01:00
Patrik Nordwall
bb9549263e Rename team to data center, #23275 2017-07-04 17:11:21 +02:00
Johan Andrén
164387a89e [WIP] one leader per cluster team (#23239)
* Guarantee no sneaky type puts more teams in the role list

* Leader per team and initial tests

* MiMa filters

* Second iteration (not working though)

* Verbose gossip logging etc.

* Gossip to team-nodes even if there is inter-team unreachability

* More work ...

* Marking removed nodes with tombstones in Gossip

* More test coverage for Gossip.remove

* Bug failing other multi-node tests squashed

* Multi-node test for team-split

* Review fixes - only prune tombstones on leader ticks

* Clean code is happy code.

* All I want is for MiMa to be my friend

* These constants are internal

* Making the formatting gods happy

* I used the wrong reachability for ignoring gossip :/

* Still hadn't quite gotten how reachability was supposed to work

* Review feedback applied

* Cross-team downing should still work

* Actually prune tombstones in the prune tombstones method ...

* Another round against reachability. Reachability leading with 15 - 2 so far.
2017-07-04 10:09:40 +02:00
Arnout Engelen
0115d5fdda Less abbreviations, more reliable test
(cherry picked from commit 61e289b276f410654c1b063c33648e0d7ea88e50)
2017-07-03 10:47:21 +02:00
Arnout Engelen
2f11ec6f25 Introduce cluster 'team' setting and add to Member
Introduced cluster-team.md so we can grow the documentation with each
PR, but did not add it to the ToC yet.

(cherry picked from commit a06badaa03fa9f3c9a942b1468090f758c74a869)
2017-07-03 10:47:14 +02:00
Patrik Nordwall
a7dc938188 Revert "Introduce cluster 'team' setting and add to Member"
This reverts commit a06badaa03fa9f3c9a942b1468090f758c74a869.
2017-07-03 10:44:36 +02:00
Patrik Nordwall
bd6afb8952 Revert "Less abbreviations, more reliable test"
This reverts commit 61e289b276f410654c1b063c33648e0d7ea88e50.
2017-07-03 10:44:24 +02:00
Arnout Engelen
9f78cd12c4 Introduce cluster 'team' setting and add to Member (#23234)
* Introduce cluster 'team' setting and add to Member

Introduced cluster-team.md so we can grow the documentation with each
PR, but did not add it to the ToC yet.

* Less abbreviations, more reliable test
2017-06-26 16:28:06 +02:00
Johan Andrén
3643f18ded Protobuf serializers for remote deployment #22332 2017-03-16 15:12:35 +01:00
Richard Imaoka
cc1312922c Allow multiple Cluster JMX MBeans in the same JVM (#22484)
* Allow multiple Cluster JMX MBeans in the same JVM (#18772)

* Remove unnecessary whitespace
2017-03-14 14:31:58 +01:00
Patrik Nordwall
1700cdaebc Promote WeaklyUp and enable by default, #22197 2017-01-24 12:31:32 +01:00
Patrik Nordwall
452b3f1406 remove old deprecated cluster metrics, #21423
* corresponding was moved to akka-cluster-metrics, see
  http://doc.akka.io/docs/akka/2.4/project/migration-guide-2.3.x-2.4.x.html#New_Cluster_Metrics_Extension
2017-01-20 13:48:36 +01:00
Patrik Nordwall
84ade6fdc3 add CoordinatedShutdown, #21537
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
* phase config obj with depends-on list
* integrate graceful leaving of sharding in coordinated shutdown
* add timeout and recover
* add some missing artery ports to tests
* leave via CoordinatedShutdown.run
* optionally exit-jvm in last phase
* run via jvm shutdown hook
* send ExitingConfirmed to leader before shutdown of Exiting
  to not have to wait for failure detector to mark it as
  unreachable before removing
* the unreachable signal is still kept as a safe guard if
  message is lost or leader dies
* PhaseClusterExiting vs MemberExited in ClusterSingletonManager
* terminate ActorSystem when cluster shutdown (via Down)
* add more predefined and custom phases
* reference documentation
* migration guide
* problem when the leader order was sys2, sys1, sys3,
  then sys3 could not perform it's duties and move Leving sys1 to
  Exiting because it was observing sys1 as unreachable
* exclude Leaving with exitingConfirmed from convergence condidtion
2017-01-16 09:01:57 +01:00
Johan Andrén
0f376e751e Quarantine gracefully downed node after some time (#21534)
* New setting for quarantining after graceful leave
2016-09-28 14:04:58 +02:00
Patrik Nordwall
0a75f992e4 Update links to Lightbend RPv2, more warnings about auto-down 2016-09-02 10:26:47 +02:00
Johan Andrén
5671927cf1 clu #20309 API for pluggable cluster downing 2016-04-18 15:06:05 +02:00
Konrad Malawski
35108384c9 =clt #19381 silence heartbeat logging in cluster client 2016-01-08 12:11:56 +01:00
Martynas Mickevičius
fb664c54a5 =doc fix URL to "The ϕ Accrual Failure Detector" paper 2015-11-04 16:26:45 +02:00
Patrik Nordwall
22b8853314 =clu #13584 mark as experimental and some doc clarificiations 2015-09-04 14:09:41 +02:00
Veiga Ortiz, Héctor
c08bc317e2 +clu #13584 Accept joining to be WeaklyUp during network split
* experimental feature, disabled by default
* Adding documentation to mention weakly up members.
  plus adding new diagram.
2015-09-04 12:44:47 +02:00
Patrik Nordwall
bfde1eff19 =clu #18337 Disable down-removal-margin by default
For manual downing it is not needed. For auto-down it doesn't add any extra safety, since that
is not handling network partitions anyway.

The setting is still useful if you implement downing strategies that handle network partitions,
e.g. by keeping the larger side of the partition and shutting down the smaller side.
2015-09-04 11:28:33 +02:00
Patrik Nordwall
bc13e1b4c2 =clu #13802 Introduce max-total-nr-of-instances for cluster aware routers 2015-08-21 14:51:59 +02:00
Patrik Nordwall
f72b1bea9f =rem,clu #17750 Decrease default expected-response-after 2015-08-19 07:34:24 +02:00
Patrik Nordwall
2a88f4fb29 =clu Improve cluster downing
* avoid using Down and Exiting member from being used for joining
* delay shut down of Down member until the information is spread
  to all reachable members, e.g. downing several nodes via one node
* akka.cluster.down-removal-margin setting
  Margin until shards or singletons that belonged to a
  downed/removed partition are created in surviving partition.
  Used by singleton and sharding.
* remove the retry count parameters/settings for singleton in
  favor of deriving those from the removal-margin
2015-06-18 12:55:54 +02:00
Patrik Nordwall
96c84a1df6 =rem #17567 Adjust parameters for DeadlineFailureDetector
To be more aligned with PhiAccrualFailureDetector the
DeadlineFailureDetector should trigger after
heartbeat-interval + acceptable-heartbeat-pause
2015-05-29 10:20:42 +02:00
Andrei Pozolotin
6332f888ce +all #16632 Make serialization identifiers configurable in reference.conf 2015-03-05 11:55:05 -06:00
Patrik Nordwall
1e445b4eba !act,rem,clu #3920 Remove deprecated old routers 2014-03-14 14:12:11 +01:00
Patrik Nordwall
b5be06e90c !clu #3920 Remove deprecated akka.cluster.auto-down
* replaced by akka.cluster.auto-down-unreachable-after
2014-03-14 14:11:28 +01:00
Patrik Nordwall
4b843476ef =clu,rem #3632 Correct wrong transport in docs 2014-01-21 15:14:27 +01:00
Patrik Nordwall
eaad7ecf7e !clu #3683 Change cluster heartbeat to req/rsp protocol
* The previous one-way hearbeat was elegant, but comlicated to
  understand and without giving much extra value compared to this approach.
* The previous one-way heartbeat have some kind of bug when joining
  several (10-20) nodes at approximately the same time (but not exactly
  the same time) with a false failure detection triggered by the extra heartbeat,
  which would not heal.
* This ping-pong approach will increase network traffic slightly, but heartbeat
  messages are small and each node is limited to monitor (default) 5 peers.
2013-11-15 08:18:52 +01:00
Patrik Nordwall
3bdac872ff =clu #3683 Don't trigger extra heartbeat when not expected sender 2013-10-22 14:48:37 +02:00
Patrik Nordwall
ff83edea0b Merge pull request #1785 from akka/wip-3458-adjust-biased-gossip-patriknw
+clu #3458 Adjust biased gossip for large cluster
2013-10-18 07:58:50 -07:00
Patrik Nordwall
532c98c6cd +clu #3458 Adjust biased gossip for large cluster 2013-10-18 14:34:36 +02:00
Patrik Nordwall
7d5a3ec30b !clu #3657 Lazy deserialization and TTL of Gossip message payload 2013-10-18 08:29:46 +02:00
Patrik Nordwall
402674ce10 +clu #3627 Cluster router group with multiple paths per node
* Use the ordinary routees.paths config property instead of
  cluster.routees-path
* Backwards compatible in deprecation phase
2013-10-16 11:44:00 +02:00
Patrik Nordwall
ebadd567b2 !act,rem,clu #3549 Simplify and enhance routers
* Separate routing logic, to be usable stand alone, e.g. in actors
* Simplify RouterConfig, only a factory
* Move reading of config from Deployer to the RouterConfig
* Distiction between Pool and Group router types
* Remove usage of actorFor, use ActorSelection
* Management messages to add and remove routees
* Simplify the internals of RoutedActorCell & co
* Move resize specific code to separate RoutedActorCell subclass
* Change resizer api to only return capacity change
* Resizer only allowed together with Pool
* Re-implement all routers, and keep old api during deprecation phase
* Replace ClusterRouterConfig, deprecation
* Rewrite documentation
* Migration guide
* Also includes related ticket:
  +act #3087 Create nicer Props factories for RouterConfig
2013-10-16 09:27:13 +02:00
Patrik Nordwall
d5b25cbbc6 !act #3583 Timer based auto-down
* Replace (deprecate) akka.cluster.auto-down config setting with
  akka.cluster.auto-down-unreachable-after
* AutoDown actor that keeps track of unreachable members
  and performs down from the leader node when they have been
  unreachable for the specified duration
* Migration guide
2013-09-27 14:32:03 +02:00
Patrik Nordwall
dc9fe4f19c !clu #2307 Allow transition from unreachable to reachable
* Replace unreachable Set with Reachability table
* Unreachable members stay in member Set
* Downing a live member was moved it to the unreachable Set,
  and then removed from there by the leader. That will not
  work when flipping back to reachable, so a Down member must
  be detected as unreachable before beeing removed. Similar
  to Exiting. Member shuts down itself if it sees itself as
  Down.
* Flip back to reachable when failure detector monitors it as
  available again
* ReachableMember event
* Can't ignore gossip from aggregated unreachable (see SurviveNetworkInstabilitySpec)
* Make use of ReachableMember event in cluster router
* End heartbeat when acknowledged, EndHeartbeatAck
* Remove nr-of-end-heartbeats from conf
* Full reachability info in JMX cluster status
* Don't use interval after unreachable for AccrualFailureDetector history
* Add QuarantinedEvent to remoting, used for Reachability.Terminated
* Prune reachability table when all reachable
* Update documentation
* Performance testing and optimizations
2013-09-11 13:10:29 +02:00
Patrik Nordwall
8c2859ad03 Make akka.cluster.MetricsCollector public, see #3452 2013-06-18 15:07:26 +02:00
Patrik Nordwall
95366cb585 Wrap long lines, for pdf 2013-05-30 14:45:15 +02:00