Commit graph

80 commits

Author SHA1 Message Date
Arnout Engelen
079aa46733 Introduce 'MemberDowned' member event (#25854)
* Introduce 'MemberDowned' member event

Compatiblity note: MemberEvent is a sealed trait, so it is debatable whether
it is acceptable to introduce a new member.

* Be more conservative (more like leaving), add test
2018-11-05 10:03:06 +00:00
kerr
fafc59b19d update headers to regular comment (#25807) 2018-10-29 05:19:37 -04:00
kenji yoshida
5b3b191bac Remove procedure syntax (#25362) 2018-07-25 13:38:27 +02:00
Konrad `ktoso` Malawski
563c7fbcf0 Issue 24594: Integration with sbt-headers and initial header population 2018-03-13 15:45:55 +01:00
Christopher Batey
5658d6e77a MultiDcSplitBain: only subscribe to unreachable after split
Test would fail picking up the reachable from the previous unsplit
as it is a new probe.

Also change barrierCounter to split/unsplit so easier to see
where the failure is on a barrier fail
2018-01-30 09:01:15 +00:00
Christopher Batey
009214ae07
Update copyright to 2018 (#24241) 2018-01-04 17:26:29 +00:00
Patrik Nordwall
5fc6d5a04a Verify removal and add of new node incarnation in multi-dc, #23585
* MemberRemoved must be published before MemberUp, e.g. when restarted
  in other DC
* remove from failureDetector when receiving gossip with new member,
  not only new joining member

* increase timeout in MultiDcSingletonManagerSpec
2017-09-25 16:47:06 +02:00
Johan Andrén
c31f6b862f cluster apis for typed, #21226
* Cluster management (join, leave, etc)
* Cluster membership subscriptions (MemberUp, MemberRemoved, etc)
* New SelfUp and SelfRemoved events
* change signature of awaitAssert to return the value (not binary compatible)
* Cluster singleton api
2017-09-21 17:58:29 +02:00
Patrik Nordwall
e3aada5016 Connect the dots for cross-dc reachability, #23377
* the crossDcFailureDetector was not connected to the reachability table
* additional test by listen for {Reachable/Unreachable}DataCenter events in split spec
* missing Java API for getUnreachableDataCenters in CurrentClusterState
2017-08-22 15:05:40 +02:00
Johan Andrén
cff43a16f7 Data center reachability in cluster state (#23359)
* Manual case-declassing of CurrentClusterState #23347

* Unreachable data centers set in CurrentClusterState #23347
2017-08-22 13:04:39 +02:00
Martynas Mickevičius
73d3c5db5d DC reachability events #23245 2017-07-12 13:48:15 +01:00
Johan Andrén
9c7e8d027a Renamed/moved the self data center setting #23312 (#23344) 2017-07-12 11:47:32 +01:00
Johan Andrén
9f4da87840 =clu #23286 filter emitted reachability event by DC 2017-07-07 16:50:36 +01:00
Johan Andrén
c0d439eac3 limit cross dc gossip #23282 2017-07-07 13:19:10 +01:00
Patrik Nordwall
867cc97bdd Refactoring of Gossip class, #23290
* move methods that depends on selfUniqueAddress and selfDc
  to a separate MembershipState class, which also holds the
  latest gossip
* this removes the need to pass in the parameters from everywhere and
  makes it easier to cache some results
* makes it clear that those parameters are always selfUniqueAddress
  and selfDc, instead of some arbitary node/dc
2017-07-05 08:47:32 +02:00
Patrik Nordwall
bb9549263e Rename team to data center, #23275 2017-07-04 17:11:21 +02:00
Johan Andrén
164387a89e [WIP] one leader per cluster team (#23239)
* Guarantee no sneaky type puts more teams in the role list

* Leader per team and initial tests

* MiMa filters

* Second iteration (not working though)

* Verbose gossip logging etc.

* Gossip to team-nodes even if there is inter-team unreachability

* More work ...

* Marking removed nodes with tombstones in Gossip

* More test coverage for Gossip.remove

* Bug failing other multi-node tests squashed

* Multi-node test for team-split

* Review fixes - only prune tombstones on leader ticks

* Clean code is happy code.

* All I want is for MiMa to be my friend

* These constants are internal

* Making the formatting gods happy

* I used the wrong reachability for ignoring gossip :/

* Still hadn't quite gotten how reachability was supposed to work

* Review feedback applied

* Cross-team downing should still work

* Actually prune tombstones in the prune tombstones method ...

* Another round against reachability. Reachability leading with 15 - 2 so far.
2017-07-04 10:09:40 +02:00
Patrik Nordwall
41c756f169 properly shutdown ArteryTransport using CoordinatedShutdown, #22671 (#22698)
* properly shutdown ArteryTransport using CoordinatedShutdown, #22671

* The shutdownHook changed hasBeenShutdown flag to true, and then when
  the transport.shutdown was invoked the shutdown sequence was ignored
  until it was too late, ActorSystem already terminated.
* Also improved the cluster shutdown tasks when the cluster node had not
  joined

* CoordinatedShutdownLeave explicit events
2017-04-11 21:48:51 +02:00
Patrik Nordwall
452b3f1406 remove old deprecated cluster metrics, #21423
* corresponding was moved to akka-cluster-metrics, see
  http://doc.akka.io/docs/akka/2.4/project/migration-guide-2.3.x-2.4.x.html#New_Cluster_Metrics_Extension
2017-01-20 13:48:36 +01:00
Patrik Nordwall
84ade6fdc3 add CoordinatedShutdown, #21537
* CoordinatedShutdown that can run tasks for configured phases in order (DAG)
* coordinate handover/shutdown of singleton with cluster exiting/shutdown
* phase config obj with depends-on list
* integrate graceful leaving of sharding in coordinated shutdown
* add timeout and recover
* add some missing artery ports to tests
* leave via CoordinatedShutdown.run
* optionally exit-jvm in last phase
* run via jvm shutdown hook
* send ExitingConfirmed to leader before shutdown of Exiting
  to not have to wait for failure detector to mark it as
  unreachable before removing
* the unreachable signal is still kept as a safe guard if
  message is lost or leader dies
* PhaseClusterExiting vs MemberExited in ClusterSingletonManager
* terminate ActorSystem when cluster shutdown (via Down)
* add more predefined and custom phases
* reference documentation
* migration guide
* problem when the leader order was sys2, sys1, sys3,
  then sys3 could not perform it's duties and move Leving sys1 to
  Exiting because it was observing sys1 as unreachable
* exclude Leaving with exitingConfirmed from convergence condidtion
2017-01-16 09:01:57 +01:00
Philippus Baalman
6c7085252a extended copyright into 2017 2017-01-04 17:37:15 +01:00
Patrik Nordwall
446c0545ec member accessor in ReachabilityEvent, #21944 (#21947) 2016-12-05 12:07:18 +01:00
Björn Antonsson
c66ce62d63 Update to a working version of Scalariform 2016-06-02 22:12:36 +02:00
Johan Andrén
62e30b3c08 Update copyrights and links to the new company name #19851 2016-02-23 12:58:39 +01:00
Prayag Verma
b7783968a0 =pro #19068 All copyrights ranges and single years updated to a range ending in 2016 2016-01-25 10:20:30 +01:00
drewhk
48282fc753 Merge pull request #18729 from hseeberger/hseeberger-18575-publish-member-joined
Publish MemberJoined
2015-11-11 11:23:04 +01:00
Patrik Nordwall
c7c187f6b7 =clu replace Set -- with diff and ++ with union
* better performance according to
  https://docs.google.com/presentation/d/1Qjryxoe-fYEM8ZPhM-98LKfbhnRcn5eAEMNlVVnixsA/pub
2015-11-06 14:48:17 +01:00
Heiko Seeberger
821dc2199b +act #18575 Publish MemberJoined 2015-10-21 17:30:28 +02:00
Patrik Nordwall
9380983d3c =clu #18554 Make oldest assignment deterministic when joining
* the reported issue is fixed by the immediate leaderActions
  (moving to Up)  when joining the first node to itself
* the other changes are precautions just in case
2015-10-21 07:53:14 +02:00
Veiga Ortiz, Héctor
c08bc317e2 +clu #13584 Accept joining to be WeaklyUp during network split
* experimental feature, disabled by default
* Adding documentation to mention weakly up members.
  plus adding new diagram.
2015-09-04 12:44:47 +02:00
Roland Kuhn
18688fc84b = #17380 fix doc comments for java8 doclint
* actor and cluster-metrics comments
* agent/camel/cluster/osgi/persistence/remote comments
* comments in contrib/persistence-tck/multi-node/typed
2015-05-18 12:51:36 +02:00
Patrik Nordwall
c991d5f1d1 =str #17200 Stop shard region when MemberRemoved
Two issues:

1) ShardRegion actor must stop itself when the node is shutting down,
   ie. when receiving MemberRemoved(selfAddress)
2) ShardCoordinator must not persist anything when the node is shutting
   down. MemberRemoved of other shard regions will trigger Terminated,
   which must not be persisted, because then the next coordinator will
   replay those events and end up in wrong state. This is a problem
   announced itself when using leaving as illustrated in the new test.

To solve the second issue I have added a new ClusterShuttingDown event
that is published before the MemberRemoved events. Note that Terminated
is triggered by MemberRemoved.

(cherry picked from commit 1b272c72597beece9d93f0054f4b58e3d25f9ae2)
2015-04-22 12:46:30 +02:00
Patrik Nordwall
fe98dae650 =clu #13875 Fix regression in leader selection
* The leader is selected by picking the first reachable member, but in
  #13875 we had to let the self member be unreachable in the Reachability
  table and that was not considered in the logic of the leader selection.
* That means changed behavior that is unwanted, especially when there
  is only one node left the leader could be evaluated to None instead
  of Some(selfUniqueAddress).
* Note that #13875 has not been released yet.
2015-03-14 11:41:28 -07:00
Julian Tescher
00f6a58e7c Changes all occurances of Typesafe copyright to extend to 2015 2015-03-10 14:12:19 -07:00
Patrik Nordwall
71ccb4c21b =clu #13875 Exclude unreachability observations from downed
* Skip observations from downed node (quarantined is marked down immediately)
  in convergence check
* Skip observations from downed node when picking "reachable" targets for gossip.
* This also means that we must accept gossip with own node marked as unreachable,
  but that should not be spread to the external membership events.
2015-02-06 10:19:48 +01:00
Andrei Pozolotin
7b9f77a073 + akka-cluster-metrics: new akka module
* new akka module split from akka-cluster
* provide sigar provisioning
* fix ewma usage
* resolve #16121
* see #16354
2015-01-19 10:23:54 -06:00
Patrik Nordwall
503c4ced8f !clu #3920 Remove deprecated Cluster.publishCurrentClusterState 2014-03-14 14:11:28 +01:00
dario.rexin
2cbad298d6 =all #3858 Make case classes final 2014-03-07 13:20:01 +01:00
Adam Voss
cce29dfa51 Changes all occurances of Typesafe copyright to extend to 2014. 2014-02-04 21:20:09 -06:00
Patrik Nordwall
2e5193347e !clu #3617 API improvements related to CurrentClusterState
* Getter for CurrentClusterState in Cluster extension, updated via
  ClusterReadView
* Remove lazy init of readView. Otherwise the cluster.state will be
  empty on first access, wich is probably surprising
* Subscribe to several cluster event types at once, to ensure *one*
  CurrentClusterEvent followed by change events
* Deprecate publishCurrentClusterState, was a bad idea, use sendCurrentClusterState
  instead
* Possibility to subscribe with InitialStateAsEvents to receive events corresponding
  to CurrentClusterState
* CurrentClusterState not a ClusterDomainEvent, ticket #3614
2014-01-16 16:17:44 +01:00
Patrik Nordwall
dc9fe4f19c !clu #2307 Allow transition from unreachable to reachable
* Replace unreachable Set with Reachability table
* Unreachable members stay in member Set
* Downing a live member was moved it to the unreachable Set,
  and then removed from there by the leader. That will not
  work when flipping back to reachable, so a Down member must
  be detected as unreachable before beeing removed. Similar
  to Exiting. Member shuts down itself if it sees itself as
  Down.
* Flip back to reachable when failure detector monitors it as
  available again
* ReachableMember event
* Can't ignore gossip from aggregated unreachable (see SurviveNetworkInstabilitySpec)
* Make use of ReachableMember event in cluster router
* End heartbeat when acknowledged, EndHeartbeatAck
* Remove nr-of-end-heartbeats from conf
* Full reachability info in JMX cluster status
* Don't use interval after unreachable for AccrualFailureDetector history
* Add QuarantinedEvent to remoting, used for Reachability.Terminated
* Prune reachability table when all reachable
* Update documentation
* Performance testing and optimizations
2013-09-11 13:10:29 +02:00
Patrik Nordwall
a323936299 Disable cluster stats by default, see #3348
* Add VectorClockStats
2013-05-28 16:15:57 +02:00
Patrik Nordwall
ee6e80d31a Add previousStatus in MemberRemoved, see #3252 2013-05-23 11:09:32 +02:00
Patrik Nordwall
a0a0f39613 Hardening of cluster member leaving path, see #3309
* Removed leader commands for Shutdown and Exit
* Member shutdown itself  when it sees itself as Exiting
* Singleton cluster with status Exiting will shutdown itself,
  in case the Exiting gossip never arrives
* Exiting member not part convergence check
* Exiting member is removed by leader (on convergence) when the
  exiting member is in the unreachable set, i.e. sucessfully shutdown
* Reverted the change made for #3266, i.e. Exiting is
  detected as unreachable again.
* Adjust ClusterSingletonManager to new Exiting behaviour
* Fix bug in HeartbeatSender, which caused it to continue to
  send heartbeats to removed nodes, instead of rebalancing
* Refactoring of leaderActions method
* Leaving section in docs
2013-05-17 11:39:49 +02:00
Björn Antonsson
539df2e98a Enforce mailbox types on System actors. See #3273 2013-05-03 11:05:32 +02:00
Patrik Nordwall
4606612bd1 Reliable remote supervision and death watch, see #2993
* RemoteWatcher that monitors node failures, with heartbeats
  and failure detector
* Move RemoteDeploymentWatcher from CARP to RARP
* ClusterRemoteWatcher that handles cluster nodes
* Update documentation
* UID in Heartbeat msg to be able to quarantine,
  actual implementation of quarantining will be implemented
  in ticket 2594
2013-04-17 19:42:51 +02:00
Patrik Nordwall
9e56ab6fe5 Disallow re-joining, see #2873
* Disallow join requests when already part of a cluster
* Remove wipe state when joining, since join can only be
  performed from empty state
* When trying to join, only accept gossip from that member
* Ignore gossips from unknown (and unreachable) members
* Make sure received gossip contains selfAddress
* Test join of fresh node with same host:port
* Remove JoinTwoClustersSpec
* Welcome message as reply to Join
* Retry unsucessful join request
* AddressUidExtension
* Uid in cluster Member identifier
  To be able to distinguish nodes with same host:port
  after restart.
* Ignore gossip with wrong uid
* Renamed Remove command to Shutdown
* Use uid in vclock identifier
* Update sample, Member apply is private
* Disabled config duration syntax and cleanup of io settings
* Update documentation
2013-04-17 16:48:18 +02:00
Björn Antonsson
73f0f44ddb Protobuf serialization of cluster messages. See #1910 2013-04-11 10:09:05 +02:00
Patrik Nordwall
7eac88f372 Cluster node roles, see #3049
* Config of node roles cluster.role
* Cluster router configurable with use-role
* RoleLeaderChanged event
* Cluster singleton per role
* Cluster only starts once all required per-role node
  counts are reached,
  role.<role-name>.min-nr-of-members config
*  Update documentation and make use of the roles in the examples
2013-03-18 11:56:11 +01:00
Patrik Nordwall
1e4b2585c7 Publish LeaderChanged when first seen, see #3131
* The problem in ClusterSingletonManagerChaosSpec was that node 4 doesn't publish
  LeaderChanged, because there is never convergence on node 4 of the new Up
  state for the three new nodes before they are shutdown. When it becomes
  convergence on node 4 prevConvergedGossip and newGossip have same leader
  (i.e. no change).
* LeaderChanged is now published when the new leader is first seen, i.e. same
  as member events. This makes sense now when leader can't be in Joining state.
2013-03-11 12:41:15 +01:00