* when SBR downs the reachable side (minority) it's important
to quickly inform everybody to shutdown
* send gossip directly to downed node, STONITH signal
* gossip to a few random immediatly when self is downed, which
is always the last from the SBR downing
* enable gossip speedup when there are downed members
* adjust StressSpect to normal again
* adjust TransitionSpect to the new behavior
* Config for when to move to WeaklyUp
* noticed when I was testing with the StressSpec that it's often moving nodes to WeaklyUp
in normal joining scenarios (also seen in Kubernetes testing)
* better to wait some longer since the WeaklyUp will require a new convergence round
and making the full joining -> up take longer time
* changed existing config property to be a duration
* default 7s, previously it was 3s
* on => 7s
* Since DeathWatchNotification is sent over the control channel it may overtake
other messages that have been sent from the same actor before it stopped.
* It can be confusing that Terminated can't be used as an end-of-conversation marker.
* In classic Remoting we didn't have this problem because all messages were sent over
the same connection.
* don't send DeathWatchNotification when system is terminating
* when using Cluster we can rely on that the other side will publish AddressTerminated
when the member has been removed
* it's actually already a race condition that often will result in that the DeathWatchNotification
from the terminating side
* in DeathWatch.scala it will remove the watchedBy when receiving AddressTerminated, and that
may (sometimes) happen before tellWatchersWeDied
* same for Unwatch
* to avoid sending many Unwatch messages when watcher's ActorSystem is terminated
* same race exists for Unwatch as for DeathWatchNotification, if RemoteWatcher publishAddressTerminated
before the watcher is terminated
* config for the flush timeout, and possibility to disable
* adjust default minimum for down-all-when-unstable
* when down-all-when-unstable=on it will be >= 4 seconds
* in case stable-after is tweaked to low value such as 5 seconds
* Update aeron-client, aeron-driver to 1.30.0
* Upgrade to agrona 1.7.2, to keep in line with aeron
Co-authored-by: Christopher Batey <christopher.batey@gmail.com>
* issue could be reproduced with sleep(200) before the persistenceTestKit.clearByPersistenceId
in EventSourcedBehaviorTestKitImpl
* problem is that there is a race condition betwen that clear and that the EventSourcedBehavior
is starting concurrently, which can result in that the EventSourcedBehavior may see events from
previous test if using same persistenceId
* solution is to clearAll before starting the EventSourcedBehavior
Adds some level of cluster awareness to both the LeastShardAllocationStrategy implementations:
* #27368 prefer shard allocations on new nodes during rolling updates
* #27367 don't rebalance during rolling update
* #29554 don't rebalance when there are joining nodes
* #29553 don't allocate to leaving, downed, exiting and unreachable nodes
* When allocating when there are joining, unreachable, are leaving are de-prioritized to decrease the risk that a shard is allocated just to directly need to be re-allocated on a different node.