2011-10-17 16:17:16 +02:00
|
|
|
|
Supervision
|
|
|
|
|
|
===========
|
|
|
|
|
|
|
|
|
|
|
|
This chapter outlines the concept behind supervision, the primitives offered
|
|
|
|
|
|
and their semantics. For details on how that translates into real code, please
|
|
|
|
|
|
refer to the corresponding chapters for Scala and Java APIs.
|
|
|
|
|
|
|
|
|
|
|
|
What Supervision Means
|
|
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
|
|
|
|
Supervision describes a dependency relationship between actors: the supervisor
|
|
|
|
|
|
delegates tasks to subordinates and therefore must respond to their failures.
|
|
|
|
|
|
When a subordinate detects a failure (i.e. throws an exception), it suspends
|
2011-10-17 17:59:04 +02:00
|
|
|
|
itself and all its subordinates and sends a message to its supervisor,
|
|
|
|
|
|
signaling failure. Depending on the nature of the work to be supervised and
|
|
|
|
|
|
the nature of the failure, the supervisor has four basic choices:
|
2011-10-17 16:17:16 +02:00
|
|
|
|
|
|
|
|
|
|
#. Resume the subordinate, keeping its accumulated internal state
|
|
|
|
|
|
#. Restart the subordinate, clearing out its accumulated internal state
|
2011-10-17 17:59:04 +02:00
|
|
|
|
#. Terminate the subordinate permanently
|
2011-10-17 16:17:16 +02:00
|
|
|
|
#. Escalate the failure
|
|
|
|
|
|
|
|
|
|
|
|
It is important to always view an actor as part of a supervision hierarchy,
|
|
|
|
|
|
which explains the existence of the fourth choice (as a supervisor also is
|
|
|
|
|
|
subordinate to another supervisor higher up) and has implications on the first
|
|
|
|
|
|
three: resuming an actor resumes all its subordinates, restarting an actor
|
|
|
|
|
|
entails restarting all its subordinates, similarly stopping an actor will also
|
2011-12-14 00:32:31 +01:00
|
|
|
|
stop all its subordinates. It should be noted that the default behavior of an
|
|
|
|
|
|
actor is to stop all its children before restarting, but this can be overridden
|
|
|
|
|
|
using the :meth:`preRestart` hook.
|
2011-10-17 16:17:16 +02:00
|
|
|
|
|
|
|
|
|
|
Each supervisor is configured with a function translating all possible failure
|
|
|
|
|
|
causes (i.e. exceptions) into one of the four choices given above; notably,
|
|
|
|
|
|
this function does not take the failed actor’s identity as an input. It is
|
|
|
|
|
|
quite easy to come up with examples of structures where this might not seem
|
|
|
|
|
|
flexible enough, e.g. wishing for different strategies to be applied to
|
|
|
|
|
|
different subordinates. At this point it is vital to understand that
|
|
|
|
|
|
supervision is about forming a recursive fault handling structure. If you try
|
|
|
|
|
|
to do too much at one level, it will become hard to reason about, hence the
|
|
|
|
|
|
recommended way in this case is to add a level of supervision.
|
|
|
|
|
|
|
|
|
|
|
|
Akka implements a specific form called “parental supervision”. Actors can only
|
|
|
|
|
|
be created by other actors—where the top-level actor is provided by the
|
|
|
|
|
|
library—and each created actor is supervised by its parent. This restriction
|
|
|
|
|
|
makes the formation of actor supervision hierarchies explicit and encourages
|
|
|
|
|
|
sound design decisions. It should be noted that this also guarantees that
|
|
|
|
|
|
actors cannot be orphaned or attached to supervisors from the outside, which
|
|
|
|
|
|
might otherwise catch them unawares. In addition, this yields a natural and
|
|
|
|
|
|
clean shutdown procedure for (parts of) actor applications.
|
|
|
|
|
|
|
|
|
|
|
|
What Restarting Means
|
|
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
|
|
|
|
When presented with an actor which failed while processing a certain message,
|
|
|
|
|
|
causes for the failure fall into three categories:
|
|
|
|
|
|
|
|
|
|
|
|
* Systematic (i.e. programming) error for the specific message received
|
|
|
|
|
|
* (Transient) failure of some external resource used during processing the message
|
|
|
|
|
|
* Corrupt internal state of the actor
|
|
|
|
|
|
|
|
|
|
|
|
Unless the failure is specifically recognizable, the third cause cannot be
|
|
|
|
|
|
ruled out, which leads to the conclusion that the internal state needs to be
|
|
|
|
|
|
cleared out. If the supervisor decides that its other children or itself is not
|
|
|
|
|
|
affected by the corruption—e.g. because of conscious application of the error
|
|
|
|
|
|
kernel pattern—it is therefore best to restart the child. This is carried out
|
|
|
|
|
|
by creating a new instance of the underlying :class:`Actor` class and replacing
|
|
|
|
|
|
the failed instance with the fresh one inside the child’s :class:`ActorRef`;
|
|
|
|
|
|
the ability to do this is one of the reasons for encapsulating actors within
|
|
|
|
|
|
special references. The new actor then resumes processing its mailbox, meaning
|
|
|
|
|
|
that the restart is not visible outside of the actor itself with the notable
|
|
|
|
|
|
exception that the message during which the failure occurred is not
|
|
|
|
|
|
re-processed.
|
|
|
|
|
|
|
2011-12-14 00:32:31 +01:00
|
|
|
|
Restarting an actor in this way recursively terminates all its children. If
|
|
|
|
|
|
this is not the right approach for certain sub-trees of the supervision
|
|
|
|
|
|
hierarchy, you may choose to retain the children, in which case they will be
|
|
|
|
|
|
recursively restarted in the same fashion as the failed parent (with the same
|
|
|
|
|
|
default to terminate children, which must be overridden on a per-actor basis,
|
|
|
|
|
|
see :class:`Actor` for details).
|
2011-10-17 16:17:16 +02:00
|
|
|
|
|
|
|
|
|
|
What Lifecycle Monitoring Means
|
|
|
|
|
|
-------------------------------
|
|
|
|
|
|
|
|
|
|
|
|
In contrast to the special relationship between parent and child described
|
|
|
|
|
|
above, each actor may monitor any other actor. Since actors emerge from
|
2011-10-17 17:59:04 +02:00
|
|
|
|
creation fully alive and restarts are not visible outside of the affected
|
2011-10-17 16:17:16 +02:00
|
|
|
|
supervisors, the only state change available for monitoring is the transition
|
|
|
|
|
|
from alive to dead. Monitoring is thus used to tie one actor to another so that
|
|
|
|
|
|
it may react to the other actor’s termination, in contrast to supervision which
|
|
|
|
|
|
reacts to failure.
|
|
|
|
|
|
|
|
|
|
|
|
Lifecycle monitoring is implemented using a :class:`Terminated` message to be
|
|
|
|
|
|
received by the behavior of the monitoring actor, where the default behavior is
|
|
|
|
|
|
to throw a special :class:`DeathPactException` if not otherwise handled. One
|
|
|
|
|
|
important property is that the message will be delivered irrespective of the
|
|
|
|
|
|
order in which the monitoring request and target’s termination occur, i.e. you
|
|
|
|
|
|
still get the message even if at the time of registration the target is already
|
|
|
|
|
|
dead.
|
|
|
|
|
|
|
|
|
|
|
|
Monitoring is particularly useful if a supervisor cannot simply restart its
|
|
|
|
|
|
children and has to stop them, e.g. in case of errors during actor
|
|
|
|
|
|
initialization. In that case it should monitor those children and re-create
|
|
|
|
|
|
them or schedule itself to retry this at a later time.
|
|
|
|
|
|
|
|
|
|
|
|
Another common use case is that an actor needs to fail in the absence of an
|
|
|
|
|
|
external resource, which may also be one of its own children. If a third party
|
|
|
|
|
|
terminates a child by way of the ``stop()`` method or sending a
|
|
|
|
|
|
:class:`PoisonPill`, the supervisor might well be affected.
|
|
|
|
|
|
|