Commit graph

164 commits

Author SHA1 Message Date
Christopher Batey
5a37cdc862 Cross DC gossip fixes #23803
* Adjust cross DC gossip probability for small nr of nodes in a DC
When a Dc is being bootstrapped the initial node has no local peers and
can not gossip if it selects a local gossip round. Start at a
probability of 1.0 for a single node cluster and move down 0.25 per node
until a 5 node DC is reached then use the cross-data-center-gossip-probability
* Fix cross DC gossip selecting of oldest members
This used to select the members based on the sort order members in
Gossip (by address) rather than by upNumber
2017-11-02 09:17:24 +01:00
Patrik Nordwall
86712d5b40 fix confusing logging when receiving gossip from unknown 2017-10-31 14:05:51 +01:00
Patrik Nordwall
5fc6d5a04a Verify removal and add of new node incarnation in multi-dc, #23585
* MemberRemoved must be published before MemberUp, e.g. when restarted
  in other DC
* remove from failureDetector when receiving gossip with new member,
  not only new joining member

* increase timeout in MultiDcSingletonManagerSpec
2017-09-25 16:47:06 +02:00
Patrik Nordwall
4f8856f108 Merge pull request #23551 from akka/wip-23502-join-timeout-patriknw
Add timeout to abort joining of seed nodes, #23502
2017-09-11 16:41:35 +02:00
Patrik Nordwall
5cf698a2f6 Add timeout to abort joining of seed nodes, #23502 2017-09-11 15:56:25 +02:00
Patrik Nordwall
cb08535e7d use right youngest when moving to Up, #23582
* also confirm TakeOverFromMe when singleton already in oldest state
2017-09-04 16:02:23 +02:00
Patrik Nordwall
1e4e7cbba2 Merge pull request #23583 from akka/wip-multi-dc-merge-master-patriknw
merge wip-multi-dc-dev back to master
2017-09-01 17:08:28 +02:00
Patrik Nordwall
e3aada5016 Connect the dots for cross-dc reachability, #23377
* the crossDcFailureDetector was not connected to the reachability table
* additional test by listen for {Reachable/Unreachable}DataCenter events in split spec
* missing Java API for getUnreachableDataCenters in CurrentClusterState
2017-08-22 15:05:40 +02:00
Patrik Nordwall
6753c1e624 Don't use WeaklyUp immediately, #23554
* see description in issue
2017-08-22 12:02:04 +02:00
Johan Andrén
9c7e8d027a Renamed/moved the self data center setting #23312 (#23344) 2017-07-12 11:47:32 +01:00
Johan Andrén
a15e459922 Merging did not prune vector clocks for tombstoned nodes #23318 2017-07-10 13:01:06 +01:00
Johan Andrén
c0d439eac3 limit cross dc gossip #23282 2017-07-07 13:19:10 +01:00
Konrad `ktoso` Malawski
b568975acc =clu #23229 multi-dc heartbeating, only N nodes perform monitoring 2017-07-07 12:17:41 +01:00
Patrik Nordwall
867cc97bdd Refactoring of Gossip class, #23290
* move methods that depends on selfUniqueAddress and selfDc
  to a separate MembershipState class, which also holds the
  latest gossip
* this removes the need to pass in the parameters from everywhere and
  makes it easier to cache some results
* makes it clear that those parameters are always selfUniqueAddress
  and selfDc, instead of some arbitary node/dc
2017-07-05 08:47:32 +02:00
Patrik Nordwall
bb9549263e Rename team to data center, #23275 2017-07-04 17:11:21 +02:00
Johan Andrén
164387a89e [WIP] one leader per cluster team (#23239)
* Guarantee no sneaky type puts more teams in the role list

* Leader per team and initial tests

* MiMa filters

* Second iteration (not working though)

* Verbose gossip logging etc.

* Gossip to team-nodes even if there is inter-team unreachability

* More work ...

* Marking removed nodes with tombstones in Gossip

* More test coverage for Gossip.remove

* Bug failing other multi-node tests squashed

* Multi-node test for team-split

* Review fixes - only prune tombstones on leader ticks

* Clean code is happy code.

* All I want is for MiMa to be my friend

* These constants are internal

* Making the formatting gods happy

* I used the wrong reachability for ignoring gossip :/

* Still hadn't quite gotten how reachability was supposed to work

* Review feedback applied

* Cross-team downing should still work

* Actually prune tombstones in the prune tombstones method ...

* Another round against reachability. Reachability leading with 15 - 2 so far.
2017-07-04 10:09:40 +02:00
Nafer Sanabria
ef76af7add =cls add logging info on seed node joining (#22724)
* =cls add logging info on seed node joining

* adjust message
2017-05-19 14:20:29 +02:00
Patrik Nordwall
41c756f169 properly shutdown ArteryTransport using CoordinatedShutdown, #22671 (#22698)
* properly shutdown ArteryTransport using CoordinatedShutdown, #22671

* The shutdownHook changed hasBeenShutdown flag to true, and then when
  the transport.shutdown was invoked the shutdown sequence was ignored
  until it was too late, ActorSystem already terminated.
* Also improved the cluster shutdown tasks when the cluster node had not
  joined

* CoordinatedShutdownLeave explicit events
2017-04-11 21:48:51 +02:00
Devis Lucato
b89008bdaf Fix "attmpts" typo 2017-03-01 12:44:32 +01:00
Patrik Nordwall
452b3f1406 remove old deprecated cluster metrics, #21423
* corresponding was moved to akka-cluster-metrics, see
  http://doc.akka.io/docs/akka/2.4/project/migration-guide-2.3.x-2.4.x.html#New_Cluster_Metrics_Extension
2017-01-20 13:48:36 +01:00
Patrik Nordwall
84ade6fdc3 add CoordinatedShutdown, #21537
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
* phase config obj with depends-on list
* integrate graceful leaving of sharding in coordinated shutdown
* add timeout and recover
* add some missing artery ports to tests
* leave via CoordinatedShutdown.run
* optionally exit-jvm in last phase
* run via jvm shutdown hook
* send ExitingConfirmed to leader before shutdown of Exiting
  to not have to wait for failure detector to mark it as
  unreachable before removing
* the unreachable signal is still kept as a safe guard if
  message is lost or leader dies
* PhaseClusterExiting vs MemberExited in ClusterSingletonManager
* terminate ActorSystem when cluster shutdown (via Down)
* add more predefined and custom phases
* reference documentation
* migration guide
* problem when the leader order was sys2, sys1, sys3,
  then sys3 could not perform it's duties and move Leving sys1 to
  Exiting because it was observing sys1 as unreachable
* exclude Leaving with exitingConfirmed from convergence condidtion
2017-01-16 09:01:57 +01:00
Patrik Nordwall
180361868c Merge pull request #22054 from akka/wip-22053-log-join-retry-patriknw
log join retries, #22053
2017-01-09 14:18:40 +01:00
Philippus Baalman
6c7085252a extended copyright into 2017 2017-01-04 17:37:15 +01:00
Patrik Nordwall
645ae4cb31 log join retries, #22053 2016-12-21 16:15:56 +01:00
Patrik Nordwall
68383b5001 harden cluster leaving, #21847
As documented in the code:

// Leader is moving itself from Leaving to Exiting. Let others know (best effort)
// before shutdown. Otherwise they will not see the Exiting state change
// and there will not be convergence until they have detected this node as
// unreachable and the required downing has finished. They will still need to detect
// unreachable, but Exiting unreachable will be removed without downing, i.e.
// normally the leaving of a leader will be graceful without the need
// for downing. However, if those final gossip messages never arrive it is
// alright to require the downing, because that is probably caused by a
// network failure anyway.

That is fine, but this change improves the selection of the nodes to
send the final gossip messages to.

I could reproduce the failure in ClusterSingletonManagerLeaveSpec and with
additional logging I verified that in the failure case it picked the "first"
node 3 times (it's random) and that node had already been shutdown (left earlier
in the test) but was not removed yet.
2016-11-18 12:33:42 +01:00
Johan Andrén
8ae0c9a888 Use long uid in artery remoting and cluster #20644 2016-09-26 15:34:59 +02:00
Endre Sándor Varga
5e830323f6 Updating to ScalaTest 3.0.0 and ScalaCheck 1.13.2 2016-08-22 11:13:49 +02:00
Patrik Nordwall
0c4d4c37ba cluster singleton improvements, #20942
* track nodes by UniqueAddress in Cluster Singleton, #20942
* reply with HandOverDone from new incarnation, #20942
* confirm as terminated immediately when new incarnation joins, #20942 instead of waiting for failure detector to mark it as unreachable this will speed-up removal when restarting cluster node with same hostname:port
2016-08-19 11:56:55 +02:00
Patrik Nordwall
d731f20bf1 suppress deadletter for the cluster joining messages 2016-08-09 17:22:31 +02:00
Björn Antonsson
c66ce62d63 Update to a working version of Scalariform 2016-06-02 22:12:36 +02:00
Yegor Andreenko
c66e3a9f02 =clu #20613 logging selfRoles during node unreachable and quarantined (#20542) 2016-05-24 14:35:50 +02:00
Johan Andrén
5671927cf1 clu #20309 API for pluggable cluster downing 2016-04-18 15:06:05 +02:00
adebski
472d404bbe =clu #19859 Relaxed constraints on downing old incarnation of rejoining node.
* Automatic downing of old node incarnation when new tries to rejoin the cluster is performed even if old incarnation was left in Leaving or Exiting state.
* Added information to clustering docs about automatic downing of old incarnations when new tries to rejoin the cluster.
2016-02-26 20:35:19 +01:00
Johannes Rudolph
b6cbc7f13a =all remove unused imports 2016-02-23 20:29:22 +01:00
Johan Andrén
62e30b3c08 Update copyrights and links to the new company name #19851 2016-02-23 12:58:39 +01:00
Prayag Verma
b7783968a0 =pro #19068 All copyrights ranges and single years updated to a range ending in 2016 2016-01-25 10:20:30 +01:00
Roland Kuhn
f1abaa1c5e Merge pull request #18875 from ktoso/wip-akka.js-cherries-ktoso
Akka.js cherries to master
2015-11-07 18:01:24 +01:00
Patrik Nordwall
c7c187f6b7 =clu replace Set -- with diff and ++ with union
* better performance according to
  https://docs.google.com/presentation/d/1Qjryxoe-fYEM8ZPhM-98LKfbhnRcn5eAEMNlVVnixsA/pub
2015-11-06 14:48:17 +01:00
Andrea
cd3d68a77c =act switch to java std lib ThreadLocalRandom 2015-11-06 14:04:33 +01:00
Patrik Nordwall
9380983d3c =clu #18554 Make oldest assignment deterministic when joining
* the reported issue is fixed by the immediate leaderActions
  (moving to Up)  when joining the first node to itself
* the other changes are precautions just in case
2015-10-21 07:53:14 +02:00
Veiga Ortiz, Héctor
c08bc317e2 +clu #13584 Accept joining to be WeaklyUp during network split
* experimental feature, disabled by default
* Adding documentation to mention weakly up members.
  plus adding new diagram.
2015-09-04 12:44:47 +02:00
Patrik Nordwall
737a50ebf3 =clu #17253 Improve cluster startup thread usage
When using a dispatcher (default or separate cluster dispatcher)
with less than 5 threads the Cluster extension initialization
could deadlock.

It was reproducable by adding a sleep before the Await of GetClusterCoreRef
in the Cluster extension constructor. The reason was that other cluster actors were
started too early and they also tried to get the Cluster extension and thereby blocking
dispatcher threads.

Note that the Cluster extension is started via ClusterActorRefProvider before
ActorSystem.apply returns.

The improvement is to start the cluster child actors lazily when the
GetClusterCoreRef is received.
2015-09-03 18:09:31 +02:00
Patrik Nordwall
5cf35938d0 =clu #13226 Prune vector clocks from removed member 2015-08-11 15:40:42 +02:00
Roland Kuhn
0de9f0ff40 Merge pull request #17641 from kukido/kukido-spellings-normalization
=doc #17329 Fixed and normalized spellings in ScalaDoc and comments
2015-06-19 12:06:53 +02:00
Patrik Nordwall
2a88f4fb29 =clu Improve cluster downing
* avoid using Down and Exiting member from being used for joining
* delay shut down of Down member until the information is spread
  to all reachable members, e.g. downing several nodes via one node
* akka.cluster.down-removal-margin setting
  Margin until shards or singletons that belonged to a
  downed/removed partition are created in surviving partition.
  Used by singleton and sharding.
* remove the retry count parameters/settings for singleton in
  favor of deriving those from the removal-margin
2015-06-18 12:55:54 +02:00
Andrey Myatlyuk
bc791eb86c =doc #17329 Fixed and normalized spellings in ScalaDoc and comments 2015-06-02 21:06:25 -07:00
Patrik Nordwall
8a7d7715b5 clu #17565 Invoke OnMemberRemoved callback when
cluster.shutdown

* must also be done when the listener actor stops before the
  MemberRemoved event has been received
* add test for this
* clarify docs with example that shuts down actor system and
  exit jvm
2015-05-27 15:42:53 +02:00
Roland Kuhn
18688fc84b = #17380 fix doc comments for java8 doclint
* actor and cluster-metrics comments
* agent/camel/cluster/osgi/persistence/remote comments
* comments in contrib/persistence-tck/multi-node/typed
2015-05-18 12:51:36 +02:00
Patrik Nordwall
aaa620c35e =clu #17362 Make cluster.joinSeedNodes equivalent to conf seed-nodes
* the difference was in the retry of failed join attempt
* also clarify the documentation
2015-05-13 10:48:18 +02:00
hepin
ccca503b4d +clu #16736 add registerOnMemberRemoved to get notified when current member removed from the cluster 2015-05-08 12:58:12 +08:00