* Testing of singleton leaving
* gossip optimization, exiting change to two oldest per role
* hardening ClusterSingletonManagerIsStuck restart, increase ClusterSingletonManagerIsStuck
* Introduce 'MemberDowned' member event
Compatiblity note: MemberEvent is a sealed trait, so it is debatable whether
it is acceptable to introduce a new member.
* Be more conservative (more like leaving), add test
* Fixes#25489 where cluster event for a previous state can override
the call to cluster.close settings it to remove
* Fix case where Removed is used as a placeholder for unknown
* Detect that joining node is 2.5.9 or earlier by empty ConfigCheck
config in InitJoin message. Then send back Address, which was the
old representation of InitJoinAck
* Include akka.version in logging to facilitate troubleshooting
Default is 5s which means if the first Read is lost and
a test ddata have any secondary nodes to query it'll
timeout waiting to get the state.
E.g. read being ignored due to loading durable state then
never gets retries
* fix NPE in shutdownTransport
* perhaps because shutdown before started
* system.dispatcher is used in other places of the shutdown
* improve logging of compression advertisment progress
* adjust RestartFlow.withBackoff parameters
* quarantine after ActorSystemTerminating signal
(will cleanup compressions)
* Quarantine idle associations
* liveness checks by sending extra HandshakeReq and update the
lastUsed when reply received
* concervative default value to survive network partition, in
case no other messages are sent
* Adjust logging and QuarantinedEvent for harmless quarantine
* Harmless if it was via the shutdown signal or cluster leaving
Each build is now over 40mb logs.
A lot of DEBUG logging was left on for test failures that have been
fixed. Added an issue # for ones that are still valid or if if it on
as the test verifies debug
* Notice that the incarnation has changed in SystemMessageDelivery
and then reset the sequence number
* Take the incarnation number into account in the ClearSystemMessageDelivery
message
* Trigger quarantine earlier in ClusterRemoteWatcher if node with
same host:port joined
* Change quarantine-removed-node-after to 5s, shouldn't be necessary
to delay it 30s
* test reproducer
* fix memory leak in SystemMessageDelivery
* initial set of tests for idle outbound associations, credit to mboogerd
* close inbound compression when quarantined, #23967
* make sure compressions for quarantined are removed in case they are lingering around
* also means that advertise will not be done for quarantined
* remove tombstone in InboundCompressions
* simplify async callbacks by using invokeWithFeedback
* compression for old incarnation, #24400
* it was fixed by the other previous changes
* also confirmed by running the SimpleClusterApp with TCP
as described in the ticket
* test with tcp and tls-tcp transport
* handle the stop signals differently for tcp transport because they
are converted to StreamTcpException
* cancel timers on shutdown
* share the top-level FR for all Association instances
* use linked queue for control and large streams, less memory usage
* remove quarantined idle Association completely after a configured delay
* note that shallow Association instances may still lingering in the
heap because of cached references from RemoteActorRef, which may
be cached by LruBoundedCache (used by resolve actor ref).
Those are small, since the queues have been removed, and the cache
is bounded.
* Refactoring to separate the Aeron specific things, ArteryAeronUdpTransport
* move Aeron specific classes to akka.remote.artery.aeron package
* move Version to ArterySettings, and describe strategy for envelope header changes
Test would fail picking up the reachable from the previous unsplit
as it is a new probe.
Also change barrierCounter to split/unsplit so easier to see
where the failure is on a barrier fail
* When leaving/downing the last node in a DC it would not
be removed in another DC, since that was only done by the
leader in the owning DC (and that is gone).
* It should be ok to eagerly remove such nodes also by
leaders in other DCs.
* Note that gossip is already sent out so for the last node
that will be spread to other DC, unless there is a network
partition. For that we can't do anything. It will be replaced
if joining again.
There exists a race where a cluter node that is being downed seens its
self as the oldest node (as it has had the other nodes removed) and it
takes over the singleton manager sending the real oldest node to go into
the End state meaning that cluster singletons never work again.
This fix simply prevents Member events being given to the Cluster
Manager FSM during a shut down, instread relying on SelfExiting.
This also hardens the test by not downing the node that the current
sharding coordinator is running on as well as fixing a bug in the
probes.
The last time this failed there was no gossip to or from a node that
didn't see fifth coming back.
Also note that this test doesn't quite test what it says as the split
brain is repaired before starting the second actor system but without
extensions to the multi jvm test kit this can't be improved.
Refs #23306