pekko/akka-docs-dev/rst/scala/stream-flows-and-basics.rst
2014-12-20 14:11:29 +01:00

167 lines
11 KiB
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

.. _stream-flow-scala:
#############################
Basics and working with Flows
#############################
Core concepts
=============
Everything in Akka Streams revolves around a number of core concepts which we introduce in detail in this section.
Akka Streams provide a way for executing bounded processing pipelines, where bounds are expressed as the number of stream
elements in flight and in buffers at any given time. Please note that while this allows to estimate an limit memory use
it is not strictly bound to the size in memory of these elements.
First we define the terminology which will be used though out the entire documentation:
Stream
An active process that involves moving and transforming data.
Element
An element is the unit which is passed through the stream. All operations as well as back-pressure are expressed in
terms of elements.
Back-pressure
A means of flow-control, and most notably adjusting the speed of upstream sources to the consumption speeds of their sinks.
In the context of Akka Streams back-pressure is always understood as *non-blocking* and *asynchronous*
Processing Stage
The common name for all building blocks that build up a Flow or FlowGraph.
Examples of a processing stage would be Stage (:class:`PushStage`, :class:`PushPullStage`, :class:`StatefulStage`,
:class:`DetachedStage`), in terms of which operations like ``map()``, ``filter()`` and others are implemented.
Sources, Flows and Sinks
------------------------
Linear processing pipelines can be expressed in Akka Streams using the following three core abstractions:
Source
A processing stage with *exactly one output*, emitting data elements in response to it's down-stream demand.
Sink
A processing stage with *exactly one input*, generating demand based on it's internal demand management strategy.
Flow
A processing stage which has *exactly one input and output*, which connects it's up and downstreams by (usually)
transforming the data elements flowing through it.
RunnableFlow
A Flow with has both ends "attached" to a Source and Sink respectively, and is ready to be ``run()``.
It is important to remember that while constructing these processing pipelines by connecting their different processing
stages no data will flow through it until it is materialized. Materialization is the process of allocating all resources
needed to run the computation described by a Flow (in Akka Streams this will often involve starting up Actors).
Thanks to Flows being simply a description of the processing pipeline they are *immutable, thread-safe, and freely shareable*,
which means that it is for example safe to share send between actorsto have one actor prepare the work, and then have it
be materialized at some completely different place in the code.
In order to be able to run a ``Flow[In,Out]`` it must be connected to a ``Sink[In]`` *and* ``Source[Out]`` of matching types.
It is also possible to directly connect a :class:`Sink` to a :class:`Source`.
.. includecode:: code/docs/stream/FlowDocSpec.scala#materialization-in-steps
The :class:`MaterializedMap` can be used to get materialized values of both sinks and sources out from the running
stream. In general, a stream can expose multiple materialized values, however the very common case of only wanting to
get back a Sinks (in order to read a result) or Sources (in order to cancel or influence it in some way) materialized
values has a small convenience method called ``runWith()``. It is available for ``Sink`` or ``Source`` and ``Flow``, with respectively,
requiring the user to supply a ``Source`` (in order to run a ``Sink``), a ``Sink`` (in order to run a ``Source``) and
both a ``Source`` and a ``Sink`` (in order to run a ``Flow``, since it has neither attached yet).
.. includecode:: code/docs/stream/FlowDocSpec.scala#materialization-runWith
It is worth pointing out that since processing stages are *immutable*, connecting them returns a new processing stage,
instead of modifying the existing instance, so while construction long flows, remember to assign the new value to a variable or run it:
.. includecode:: code/docs/stream/FlowDocSpec.scala#source-immutable
.. note::
By default Akka Streams elements support **exactly one** downstream processing stage.
Making fan-out (supporting multiple downstream processing stages) an explicit opt-in feature allows default stream elements to
be less complex and more efficient. Also it allows for greater flexibility on *how exactly* to handle the multicast scenarios,
by providing named fan-out elements such as broadcast (signals all down-stream elements) or balance (signals one of available down-stream elements).
In the above example we used the ``runWith`` method, which both materializes the stream and returns the materialized value
of the given sink or source.
.. _back-pressure-explained-scala:
Back-pressure explained
-----------------------
Akka Streams implements an asynchronous non-blocking back-pressure protocol standardised by the Reactive Streams
specification, which Akka is a founding member of.
As library user you do not have to write any explicit back-pressure handling code in order for it to work - it is built
and dealt with automatically by all of the provided Akka Streams processing stages. However is possible to include
explicit buffers with overflow strategies that can influence the behaviour of the stream. This is especially important
in complex processing graphs which may even sometimes even contain loops (which *must* be treated with very special
care, as explained in :ref:`cycles-scala`).
The back pressure protocol is defined in terms of the number of elements a downstream ``Subscriber`` is able to receive,
referred to as ``demand``. This demand is the *number of elements* receiver of the data, referred to as ``Subscriber``
in Reactive Streams, and implemented by ``Sink`` in Akka Streams is able to safely consume at this point in time.
The source of data referred to as ``Publisher`` in Reactive Streams terminology and implemented as ``Source`` in Akka
Streams guarantees that it will never emit more elements than the received total demand for any given ``Subscriber``.
.. note::
The Reactive Streams specification defines its protocol in terms of **Publishers** and **Subscribers**.
These types are *not* meant to be user facing API, instead they serve as the low level building blocks for
different Reactive Streams implementations.
Akka Streams implements these concepts as **Sources**, **Flows** (referred to as **Processor** in Reactive Streams)
and **Sinks** without exposing the Reactive Streams interfaces directly.
If you need to inter-op between different read :ref:`integration-with-Reactive-Streams-enabled-libraries`.
The mode in which Reactive Streams back-pressure works can be colloquially described as "dynamic push / pull mode",
since it will switch between push or pull based back-pressure models depending on if the downstream is able to cope
with the upstreams production rate or not.
To illustrate further let us consider both problem situations and how the back-pressure protocol handles them:
Slow Publisher, fast Subscriber
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is the happy case of coursewe do not need to slow down the Publisher in this case. However signalling rates are
rarely constant and could change at any point in time, suddenly ending up in a situation where the Subscriber is now
slower than the Publisher. In order to safeguard from these situations, the back-pressure protocol must still be enabled
during such situations, however we do not want to pay a high penalty for this safety net being enabled.
The Reactive Streams protocol solves this by asynchronously signalling from the Subscriber to the Publisher
`Request(n:Int)` signals. The protocol guarantees that the Publisher will never signal *more* than the demand it was
signalled. Since the Subscriber however is currently faster, it will be signalling these Request messages at a higher
rate (and possibly also batching together the demand - requesting multiple elements in one Request signal). This means
that the Publisher should not ever have to wait (be back-pressured) with publishing its incoming elements.
As we can see, in this scenario we effectively operate in so called push-mode since the Publisher can continue producing
elements as fast as it can, since the pending demand will be recovered just-in-time while it is emitting elements.
Fast Publisher, slow Subscriber
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is the case when back-pressuring the ``Publisher`` is required, because the ``Subscriber`` is not able to cope with
the rate at which its upstream would like to emit data elements.
Since the ``Publisher`` is not allowed to signal more elements than the pending demand signalled by the ``Subscriber``,
it will have to abide to this back-pressure by applying one of the below strategies:
- not generate elements, if it is able to control their production rate,
- try buffering the elements in a *bounded* manner until more demand is signalled,
- drop elements until more demand is signalled,
- tear down the stream if unable to apply any of the above strategies.
As we can see, this scenario effectively means that the ``Subscriber`` will *pull* the elements from the Publisher
this mode of operation is referred to as pull-based back-pressure.
.. _stream-materialization-scala:
Stream Materialization
----------------------
**TODO - write me (feel free to move around as well)**
When constructing flows and graphs in Akka Streams think of them as preparing a blueprint, an execution plan.
Stream materialization is the process of taking a stream description (the graph) and allocating all the necessary resources
it needs in order to run. In the case of Akka Streams this often means starting up Actors which power the processing,
but is not restricted to that - it could also mean opening files or socket connections etc. depending on what the stream needs.
Materialization is triggered at so called "terminal operations". Most notably this includes the various forms of the ``run()``
and ``runWith()`` methods defined on flow elements as well as a small number of special syntactic sugars for running with
well-known sinks, such as ``foreach(el => )`` (being an alias to ``runWith(Sink.foreach(el => ))``.
Materialization is currently performed synchronously on the materializing thread.
Tha actual stream processing is handled by :ref:`Actors actor-scala` started up during the streams materialization,
which will be running on the thread pools they have been configured to run on - which defaults to the dispatcher set in
:class:`MaterializationSettings` while constructing the :class:`FlowMaterializer`.
.. note::
Reusing *instances* of linear computation stages (Source, Sink, Flow) inside FlowGraphs is legal,
yet will materialize that stage multiple times.