Commit graph

1389 commits

Author SHA1 Message Date
Johannes Rudolph
0251886111 =clu add comments for Reachability methods 2017-07-04 12:58:28 +02:00
Johan Andrén
164387a89e [WIP] one leader per cluster team (#23239)
* Guarantee no sneaky type puts more teams in the role list

* Leader per team and initial tests

* MiMa filters

* Second iteration (not working though)

* Verbose gossip logging etc.

* Gossip to team-nodes even if there is inter-team unreachability

* More work ...

* Marking removed nodes with tombstones in Gossip

* More test coverage for Gossip.remove

* Bug failing other multi-node tests squashed

* Multi-node test for team-split

* Review fixes - only prune tombstones on leader ticks

* Clean code is happy code.

* All I want is for MiMa to be my friend

* These constants are internal

* Making the formatting gods happy

* I used the wrong reachability for ignoring gossip :/

* Still hadn't quite gotten how reachability was supposed to work

* Review feedback applied

* Cross-team downing should still work

* Actually prune tombstones in the prune tombstones method ...

* Another round against reachability. Reachability leading with 15 - 2 so far.
2017-07-04 10:09:40 +02:00
Arnout Engelen
0115d5fdda Less abbreviations, more reliable test
(cherry picked from commit 61e289b276f410654c1b063c33648e0d7ea88e50)
2017-07-03 10:47:21 +02:00
Arnout Engelen
2f11ec6f25 Introduce cluster 'team' setting and add to Member
Introduced cluster-team.md so we can grow the documentation with each
PR, but did not add it to the ToC yet.

(cherry picked from commit a06badaa03fa9f3c9a942b1468090f758c74a869)
2017-07-03 10:47:14 +02:00
Patrik Nordwall
a7dc938188 Revert "Introduce cluster 'team' setting and add to Member"
This reverts commit a06badaa03fa9f3c9a942b1468090f758c74a869.
2017-07-03 10:44:36 +02:00
Patrik Nordwall
bd6afb8952 Revert "Less abbreviations, more reliable test"
This reverts commit 61e289b276f410654c1b063c33648e0d7ea88e50.
2017-07-03 10:44:24 +02:00
Arnout Engelen
9f78cd12c4 Introduce cluster 'team' setting and add to Member (#23234)
* Introduce cluster 'team' setting and add to Member

Introduced cluster-team.md so we can grow the documentation with each
PR, but did not add it to the ToC yet.

* Less abbreviations, more reliable test
2017-06-26 16:28:06 +02:00
Patrik Nordwall
edef9e34c7 serialize-creators=off in tests, #23003 2017-05-22 20:11:03 +02:00
Nafer Sanabria
ef76af7add =cls add logging info on seed node joining (#22724)
* =cls add logging info on seed node joining

* adjust message
2017-05-19 14:20:29 +02:00
Philippus Baalman
ef9c7313b6 Extend copyright into 2017 (#22833) 2017-05-04 15:14:33 +02:00
Patrik Nordwall
8e57304c7d update to Aeron 1.2.5, and fix the SharedMediaDriverSupport 2017-04-18 15:16:01 +02:00
Patrik Nordwall
3b53daa370 Revert "update to Aeron 1.2.4, and fix the SharedMediaDriverSupport, #22693"
This reverts commit 3d0d50e98b.
2017-04-12 07:38:02 +02:00
Patrik Nordwall
41c756f169 properly shutdown ArteryTransport using CoordinatedShutdown, #22671 (#22698)
* properly shutdown ArteryTransport using CoordinatedShutdown, #22671

* The shutdownHook changed hasBeenShutdown flag to true, and then when
  the transport.shutdown was invoked the shutdown sequence was ignored
  until it was too late, ActorSystem already terminated.
* Also improved the cluster shutdown tasks when the cluster node had not
  joined

* CoordinatedShutdownLeave explicit events
2017-04-11 21:48:51 +02:00
Patrik Nordwall
3d0d50e98b update to Aeron 1.2.4, and fix the SharedMediaDriverSupport, #22693
* SharedMediaDriverSupport failed with NPE with Aeron 1.2.4, and
  concludeAeronDirectory solves that
2017-04-11 18:30:18 +02:00
Hawstein
6434cbe868 Re-implement javadsl testkit (#22240)
* re-implement javadsl testkit

* fix mima problem

* rebase master

* move ImplicitSender/DefaultTimeout to scaladsl

* undo the change of moving scala api

* fix return type and add doc

* resolve conflicts and add more comments
2017-03-16 20:02:47 +01:00
Johan Andrén
3643f18ded Protobuf serializers for remote deployment #22332 2017-03-16 15:12:35 +01:00
Richard Imaoka
cc1312922c Allow multiple Cluster JMX MBeans in the same JVM (#22484)
* Allow multiple Cluster JMX MBeans in the same JVM (#18772)

* Remove unnecessary whitespace
2017-03-14 14:31:58 +01:00
Devis Lucato
b89008bdaf Fix "attmpts" typo 2017-03-01 12:44:32 +01:00
Martynas Mickevičius
1754625202 #22353 fix mbean expected json format 2017-02-21 13:05:36 +02:00
Richard Imaoka
6936c09e4e Fix JSON formatting of the jmx-cluster/akka-cluster tool #21250 2017-02-20 14:55:43 +01:00
Johan Andrén
52a20f2ba9 Micro kernel module removed #22205 2017-01-26 15:40:54 +01:00
Patrik Nordwall
4703e30774 disable weakly-up for some tests 2017-01-25 07:20:24 +01:00
Patrik Nordwall
94e40460a4 Merge pull request #22206 from akka/wip-21423-remove-deprecations-patriknw
remove deprecations, #21423
2017-01-24 16:45:31 +01:00
Patrik Nordwall
db74c33130 remove deprecated constructor in serializers, #21423 2017-01-24 13:34:05 +01:00
Patrik Nordwall
1700cdaebc Promote WeaklyUp and enable by default, #22197 2017-01-24 12:31:32 +01:00
Patrik Nordwall
af142f82fd change router type in cluster.StressSpec
* it was an oversight when old cluster metrics was removed
2017-01-23 21:18:25 +01:00
Patrik Nordwall
452b3f1406 remove old deprecated cluster metrics, #21423
* corresponding was moved to akka-cluster-metrics, see
  http://doc.akka.io/docs/akka/2.4/project/migration-guide-2.3.x-2.4.x.html#New_Cluster_Metrics_Extension
2017-01-20 13:48:36 +01:00
Patrik Nordwall
6c8a69109a Merge pull request #22138 from VEINHORN/master
Remove unnecessary new keywords
2017-01-17 19:31:45 +01:00
Patrik Nordwall
84ade6fdc3 add CoordinatedShutdown, #21537
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
* phase config obj with depends-on list
* integrate graceful leaving of sharding in coordinated shutdown
* add timeout and recover
* add some missing artery ports to tests
* leave via CoordinatedShutdown.run
* optionally exit-jvm in last phase
* run via jvm shutdown hook
* send ExitingConfirmed to leader before shutdown of Exiting
  to not have to wait for failure detector to mark it as
  unreachable before removing
* the unreachable signal is still kept as a safe guard if
  message is lost or leader dies
* PhaseClusterExiting vs MemberExited in ClusterSingletonManager
* terminate ActorSystem when cluster shutdown (via Down)
* add more predefined and custom phases
* reference documentation
* migration guide
* problem when the leader order was sys2, sys1, sys3,
  then sys3 could not perform it's duties and move Leving sys1 to
  Exiting because it was observing sys1 as unreachable
* exclude Leaving with exitingConfirmed from convergence condidtion
2017-01-16 09:01:57 +01:00
VEINHORN
0eac4d413b removed unnecessary new keywords 2017-01-13 12:35:05 +03:00
Patrik Nordwall
180361868c Merge pull request #22054 from akka/wip-22053-log-join-retry-patriknw
log join retries, #22053
2017-01-09 14:18:40 +01:00
Philippus Baalman
6c7085252a extended copyright into 2017 2017-01-04 17:37:15 +01:00
Patrik Nordwall
645ae4cb31 log join retries, #22053 2016-12-21 16:15:56 +01:00
Patrik Nordwall
e494ec2183 catch NotSerializableException from deserialization, #20641
* to be able to introduce new messages and still support rolling upgrades,
  i.e. a cluster of mixed versions
* note that it's only catching NotSerializableException, which we already
  use for unknown serializer ids and class manifests
* note that it is not catching for system messages, since that could result
  in infinite resending
2016-12-16 20:14:37 +01:00
Patrik Nordwall
1a12e950ff Reachability.remove didn't always remove all, #22012
* the versions table in Reachability was not cleared
  if the records for removed node had been pruned, i.e.
  all reachable again
2016-12-16 12:25:37 +01:00
Patrik Nordwall
f6a1fba824 =clu don't use Down member as leader, #21906 (#21990)
* in the failed test it was noticed that a Down member removed
  itself in leaderActionsOnConvergence which resulted in
  later "Failed to serialize Gossip, Unknown address"
* never use member with status Down as leader
* a node will anyway shutdown itself when it's Down,
  but leader actions could happen before that
2016-12-13 10:53:39 +01:00
Patrik Nordwall
dce668771e fix shutdown of pending StressSpec, #21960 (#21963) 2016-12-07 15:38:11 +01:00
Patrik Nordwall
2ef6457311 enable NodeChurnSpec, #21483
* Verify that it actually fails with classic remoting
  if vector clocks are not pruned
* Make it pass with Artery, but it is not verifying
  the message sizes yet. We should implement that
  with a custom RemoteInstrument, but that can be done
  in separate PR.
* Still pending with Artery because it still fails on jenkins
* barrier after sys shutdown

(cherry picked from commit d5edcbea35ca5b43ca4cfb3018602dd555402f42)
2016-12-05 14:27:12 +01:00
Patrik Nordwall
446c0545ec member accessor in ReachabilityEvent, #21944 (#21947) 2016-12-05 12:07:18 +01:00
Patrik Nordwall
e04444567f Speedup pull request validation
* speedup ActorCreationPerfSpec
* reduce iterations in ConsistencySpec
* tag SupervisorHierarchySpec as LongRunningTest
* various small speedups and tagging in actor-tests
* speedup expectNoMsg in stream-tests
* tag FramingSpec, and reduce iterations
* speedup QueueSourceSpec
* tag some stream-tests
* reduce iterations in persistence.PerformanceSpec
* reduce iterations in some cluster perf tests
* tag RemoteWatcherSpec
* tag InterpreterStressSpec
* remove LongRunning from ClusterConsistentHashingRouterSpec
* sys property to disable multi-jvm tests in test
* actually disable multi-node tests in validatePullRequest
* doc sbt flags in CONTRIBUTING
2016-11-30 14:31:06 +01:00
Johan Andrén
2679be5ae4 Disable serialization warnings in akka test suites #21882 2016-11-23 12:02:36 +01:00
Patrik Nordwall
e101fe1232 Merge pull request #21869 from akka/wip-21810-pending-patriknw
mark StressSpec pending for Artery until we fix it, #21810
2016-11-18 15:44:49 +01:00
Patrik Nordwall
cc170df4d2 mark StressSpec pending for Artery until we fix it, #21810 2016-11-18 13:06:33 +01:00
Patrik Nordwall
68383b5001 harden cluster leaving, #21847
As documented in the code:

// Leader is moving itself from Leaving to Exiting. Let others know (best effort)
// before shutdown. Otherwise they will not see the Exiting state change
// and there will not be convergence until they have detected this node as
// unreachable and the required downing has finished. They will still need to detect
// unreachable, but Exiting unreachable will be removed without downing, i.e.
// normally the leaving of a leader will be graceful without the need
// for downing. However, if those final gossip messages never arrive it is
// alright to require the downing, because that is probably caused by a
// network failure anyway.

That is fine, but this change improves the selection of the nodes to
send the final gossip messages to.

I could reproduce the failure in ClusterSingletonManagerLeaveSpec and with
additional logging I verified that in the failure case it picked the "first"
node 3 times (it's random) and that node had already been shutdown (left earlier
in the test) but was not removed yet.
2016-11-18 12:33:42 +01:00
Patrik Nordwall
136e64b253 use longUid in ClusterRemoteWatcher, #21594
* found by test failure in SurviveNetworkInstabilitySpec
2016-09-30 10:51:51 +02:00
Johan Andrén
0f376e751e Quarantine gracefully downed node after some time (#21534)
* New setting for quarantining after graceful leave
2016-09-28 14:04:58 +02:00
Patrik Nordwall
86d912a299 Merge pull request #21555 from akka/wip-21522-StressSpec-patriknw
increase acceptable-heartbeat-pause in StressSpec, #21522
2016-09-26 19:21:07 +02:00
Johan Andrén
8ae0c9a888 Use long uid in artery remoting and cluster #20644 2016-09-26 15:34:59 +02:00
Patrik Nordwall
d91ddb7891 increase acceptable-heartbeat-pause in StressSpec, #21522 2016-09-23 15:50:32 +02:00
Patrik Nordwall
63917c1947 Merge pull request #21513 from akka/wip-21512-quick-restart-patriknw
fix problem with quick restart, #21512
2016-09-22 18:33:22 +02:00