Commit graph

627 commits

Author SHA1 Message Date
Patrik Nordwall
84ade6fdc3 add CoordinatedShutdown, #21537
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
* phase config obj with depends-on list
* integrate graceful leaving of sharding in coordinated shutdown
* add timeout and recover
* add some missing artery ports to tests
* leave via CoordinatedShutdown.run
* optionally exit-jvm in last phase
* run via jvm shutdown hook
* send ExitingConfirmed to leader before shutdown of Exiting
  to not have to wait for failure detector to mark it as
  unreachable before removing
* the unreachable signal is still kept as a safe guard if
  message is lost or leader dies
* PhaseClusterExiting vs MemberExited in ClusterSingletonManager
* terminate ActorSystem when cluster shutdown (via Down)
* add more predefined and custom phases
* reference documentation
* migration guide
* problem when the leader order was sys2, sys1, sys3,
  then sys3 could not perform it's duties and move Leving sys1 to
  Exiting because it was observing sys1 as unreachable
* exclude Leaving with exitingConfirmed from convergence condidtion
2017-01-16 09:01:57 +01:00
Philippus Baalman
6c7085252a extended copyright into 2017 2017-01-04 17:37:15 +01:00
Patrik Nordwall
dce668771e fix shutdown of pending StressSpec, #21960 (#21963) 2016-12-07 15:38:11 +01:00
Patrik Nordwall
2ef6457311 enable NodeChurnSpec, #21483
* Verify that it actually fails with classic remoting
  if vector clocks are not pruned
* Make it pass with Artery, but it is not verifying
  the message sizes yet. We should implement that
  with a custom RemoteInstrument, but that can be done
  in separate PR.
* Still pending with Artery because it still fails on jenkins
* barrier after sys shutdown

(cherry picked from commit d5edcbea35ca5b43ca4cfb3018602dd555402f42)
2016-12-05 14:27:12 +01:00
Patrik Nordwall
e04444567f Speedup pull request validation
* speedup ActorCreationPerfSpec
* reduce iterations in ConsistencySpec
* tag SupervisorHierarchySpec as LongRunningTest
* various small speedups and tagging in actor-tests
* speedup expectNoMsg in stream-tests
* tag FramingSpec, and reduce iterations
* speedup QueueSourceSpec
* tag some stream-tests
* reduce iterations in persistence.PerformanceSpec
* reduce iterations in some cluster perf tests
* tag RemoteWatcherSpec
* tag InterpreterStressSpec
* remove LongRunning from ClusterConsistentHashingRouterSpec
* sys property to disable multi-jvm tests in test
* actually disable multi-node tests in validatePullRequest
* doc sbt flags in CONTRIBUTING
2016-11-30 14:31:06 +01:00
Johan Andrén
2679be5ae4 Disable serialization warnings in akka test suites #21882 2016-11-23 12:02:36 +01:00
Patrik Nordwall
cc170df4d2 mark StressSpec pending for Artery until we fix it, #21810 2016-11-18 13:06:33 +01:00
Patrik Nordwall
86d912a299 Merge pull request #21555 from akka/wip-21522-StressSpec-patriknw
increase acceptable-heartbeat-pause in StressSpec, #21522
2016-09-26 19:21:07 +02:00
Johan Andrén
8ae0c9a888 Use long uid in artery remoting and cluster #20644 2016-09-26 15:34:59 +02:00
Patrik Nordwall
d91ddb7891 increase acceptable-heartbeat-pause in StressSpec, #21522 2016-09-23 15:50:32 +02:00
Patrik Nordwall
9f175f56de fix problem with quick restart, #21512
* image-liveness-timeout must be less than the handshake-timeout,
  otherwise the publication for the handshake will give up too early
  when previous image is still considered alive
2016-09-21 20:27:04 +02:00
Endre Sándor Varga
8ecd7419ac #21419: Reenable ClusterDeathWatchSpec 2016-09-19 12:48:07 +02:00
Johan Andrén
392ca5ecce Enable flight recorder in tests #21205
* Setting to configure where the flight recorder puts its file
* Run ArteryMultiNodeSpecs with flight recorder enabled
* More cleanup in exit hook, wait for task runner to stop
* Enable flight recorder for the cluster multi node tests
* Enable flight recorder for multi node remoting tests
* Toggle always-dump flight recorder output when akka.remote.artery.always-dump-flight-recorder is set
2016-09-16 15:12:40 +02:00
Patrik Nordwall
835125de3d make cluster.StressSpec pass with Artery, #21458
* need to use a shared media driver to get the cpu usage
  at a reasonable level
* also changed to SleepingIdleStrategy(1 ms) when cpu-level=1
  not needed for the test to pass, but can be good to make level 1
  more extreme
2016-09-16 12:58:41 +02:00
Patrik Nordwall
bf151e9793 don't quarantine back, #21450
* Don't quarantine the other system when receiving the Quarantined message,
  since that will result cluster member removal and can result in
  forming two separate clusters (cluster split).
* Instead, the downing strategy should act on ThisActorSystemQuarantinedEvent, e.g.
  use it as a STONITH signal.
2016-09-13 08:01:58 +02:00
Johan Andrén
3502f0d72f One more missed canonical.port in cluster tests (#21428) 2016-09-09 18:12:35 +02:00
Johan Andrén
b0e03058b9 Port and hostname config path was changed, cluster tests didn't get the change (#21427) 2016-09-09 17:55:02 +02:00
Johan Andrén
fa1d6d6f19 Disable ClusterDeathWatchSpec for now (#21421) 2016-09-09 17:54:13 +02:00
Johan Andrén
90193907fe Make cluster tests run with artery #21204 2016-09-07 16:41:03 +02:00
Patrik Nordwall
8ab02738b7 Merge branch 'master' into wip-sync-artery-dev-2.4.9-patriknw 2016-08-23 20:14:15 +02:00
Patrik Nordwall
0aca351d81 harden SurviveNetworkInstabilitySpec #18767 2016-08-23 17:51:57 +02:00
Patrik Nordwall
483d46ddd0 harden NodeChurnSpec, #21053 2016-08-08 17:50:24 +02:00
Johan Andrén
d6c048f59a A simpler ActorRefProvider config #20649 (#20767)
* Provide shorter aliases for the ActorRefProviders #20649
* Use the new actorefprovider aliases throughout code and docs
* Cleaner alias replacement logic
2016-06-10 15:04:13 +02:00
Johan Andrén
896ea53dd3 recovery timeout for persistent actors #20698 2016-06-03 14:17:41 +02:00
Patrik Nordwall
3465a221f0 format with new Scalariform version
* and fix mima issue
2016-06-03 12:56:49 +02:00
Patrik Nordwall
839ec5f167 Merge branch 'master' into wip-sync-artery-patriknw 2016-06-03 11:09:17 +02:00
Patrik Nordwall
c15e04e051 Merge pull request #20700 from akka/wip-20639-restarting-node2-patriknw
test for restarting node, #20639
2016-06-03 09:27:15 +02:00
Björn Antonsson
c66ce62d63 Update to a working version of Scalariform 2016-06-02 22:12:36 +02:00
Patrik Nordwall
91c8e90f82 test for restarting node, #20639 2016-06-02 12:52:55 +02:00
Johan Andrén
5e3eb4bd8c Auto port selection and SunnyWeatherSpec for Artery (#20512)
* Automatic port selection when port 0 configured
* Combine remoting and artery SunnyWeatherSpec
* Default to port 0 for artery in MultiNodeSpec.nodeConfig
2016-05-17 14:17:21 +02:00
Patrik Nordwall
0ec6bd35da fix wrong setting in AdaptiveLoadBalancingRouterSpec, #18156 2016-03-22 15:31:27 +01:00
Patrik Nordwall
12db887ebb Merge pull request #20106 from akka/wip-19536-NodeChurnSpec-patriknw
harden cluster.NodeChurnSpec, #19536
2016-03-22 15:12:14 +01:00
Patrik Nordwall
3e7cd4d98c Merge pull request #20093 from akka/wip-19780-ack-takeover-patriknw
rem #19780: Skip acks during connection handoff
2016-03-22 14:01:29 +01:00
Patrik Nordwall
86d6b91846 harden cluster.NodeChurnSpec, #19536
* increasing the awaitAssert timeout since logs show that things are
  "still in progress" when leaving
2016-03-22 08:38:43 +01:00
Patrik Nordwall
b28933967f harden cluster.StressSpec, #20095
* logs show unexpected unreachable (and downing) during the high
  throughput test
* increasing the failure-detector timeout
2016-03-22 08:35:08 +01:00
Patrik Nordwall
96b68f6437 rem #19780: Skip acks during connection handoff
* The problem: ACK that was targeted to an old incarnation
  was sent to the new, restarted, system with same host:port, and
  therefore resulting issues noticed as
  "Error encountered while processing system message acknowledgement buffer: [-1 {}] ack: ACK[0, {}]"
  when restarting actor system

* The reason:

  1. The endpoint reader was about to send OutgoingAck to parent reader,
     targeted to the old system.
  2. At the same time there is an incoming connection from new system
     that triggered TakeOver in the endpoint writer, i.e. replacing
     the handle to the connection of the new system.
  3. The OutgoingAck is received by the writer, which happily sends it
     to the new handle, the new system.

* The solution: Ignore OutgoingAck during the handoff (TakeOver) process.
2016-03-21 08:55:19 +01:00
Johan Andrén
62e30b3c08 Update copyrights and links to the new company name #19851 2016-02-23 12:58:39 +01:00
Prayag Verma
b7783968a0 =pro #19068 All copyrights ranges and single years updated to a range ending in 2016 2016-01-25 10:20:30 +01:00
Patrik Nordwall
4d64901228 =clu #19274 failure detection of joining/down member status
* Failure detection heartbeating was not performed to joining
  nodes, since it was expected that they will become Up first.
* If a joining node is downed before it is changed to Up failure
  detection will not be performed for that node. That resulted in
  the downed node will not be removed from membership, since the
  unreachability signal is used as confirmation that the node is
  actually stopped before removing it.
2015-12-26 11:30:18 +01:00
drewhk
48282fc753 Merge pull request #18729 from hseeberger/hseeberger-18575-publish-member-joined
Publish MemberJoined
2015-11-11 11:23:04 +01:00
Roland Kuhn
f1abaa1c5e Merge pull request #18875 from ktoso/wip-akka.js-cherries-ktoso
Akka.js cherries to master
2015-11-07 18:01:24 +01:00
Patrik Nordwall
1e36e5e187 Merge pull request #18746 from akka/wip-18554-singleton-startup-patriknw
=clu #18554 Make oldest assignment deterministic when joining
2015-11-06 14:48:57 +01:00
Patrik Nordwall
c7c187f6b7 =clu replace Set -- with diff and ++ with union
* better performance according to
  https://docs.google.com/presentation/d/1Qjryxoe-fYEM8ZPhM-98LKfbhnRcn5eAEMNlVVnixsA/pub
2015-11-06 14:48:17 +01:00
Andrea
cd3d68a77c =act switch to java std lib ThreadLocalRandom 2015-11-06 14:04:33 +01:00
Kailuo Wang
90cba9ce0d +act #18356 Metrics based resizer for router 2015-10-22 11:14:00 -04:00
Heiko Seeberger
821dc2199b +act #18575 Publish MemberJoined 2015-10-21 17:30:28 +02:00
Patrik Nordwall
9380983d3c =clu #18554 Make oldest assignment deterministic when joining
* the reported issue is fixed by the immediate leaderActions
  (moving to Up)  when joining the first node to itself
* the other changes are precautions just in case
2015-10-21 07:53:14 +02:00
Patrik Nordwall
94896e8e75 =rem #18339 Use explicit handshake timeout
* instead of using transport failure detector
* add a new config property akka.remote.handshake-timeout, but
  for netty.tcp and netty.ssl the existing netty.tcp.connection-timeout
  setting will be used
* add test of the timeouts
* mima filter for internal ProtocolStateActor
2015-10-19 14:34:52 +02:00
Veiga Ortiz, Héctor
c08bc317e2 +clu #13584 Accept joining to be WeaklyUp during network split
* experimental feature, disabled by default
* Adding documentation to mention weakly up members.
  plus adding new diagram.
2015-09-04 12:44:47 +02:00
Patrik Nordwall
2566c7a2b6 =clu #18156 Harden AdaptiveLoadBalancingRouterSpec 2015-08-26 12:13:50 +02:00