Infrastructure Architecture
Fig. 15 Dragon Runtime Architecture in a multi-node deployment
There are various actors involved in the Dragon runtime, they are shown in Fig. 15. Although they
will be specified in more detail in other documents, we list them here and summarize their function. Whenever
necessary, the user program is called my.py
and is started from the command line by invoking the
Launcher with dragon my.py
. In particular, no preliminary interaction with the system’s workload
manager (if present) is required.
Launcher
The Launcher service is the dragon
executable that brings up the other pieces of the runtime from
nothing and arranges for the user program (e.g. my.py
) to start executing. It consists of a frontend and
backend component. The backend communicates with the frontend to provide all the functionality of the
Launcher on compute-nodes. In particular, it routes traffic the node local Local Services process may
have to the frontend and connects traffic that the launcher frontend may have to Channels. In the
single-node case, the launcher frontend and backend talk directly to each other. In the multi-node case, they
use the TCP Transport to communicate. See Multi Node Deployment and Single Node Deployment for more
details on the deployment process.
Local Services
The Local Services process is the direct parent of all managed processes as well as some of the infrastructure processes.
Local Services’ responsibilities include:
creating a named shared memory pools (default and infrastructure pool) for interprocess communication during bringup
instantiating infrastructure channels in that segment.
starting managed and unmanaged processes on its node
aggregating, filtering, and forwarding stdout and stderr from the processes it has started ultimately back to the launcher
propagating messages to a process’s stdin from a channel
In the single node case, there is exactly one shepherd. In a multi node case, there is exactly one shepherd process per node. See ProcessCreationAndInteraction for more details on process creation.
Global Services
Global Services maintains a global namespace and tracks the state of global objects in a Dragon program, which include managed processes and channels. This is done on the Python level using the API FIXME: add link – but fundamentally this is a message based service and can be interacted with by non Python programs. Global Services will ultimately be distributed over a hierarchy of service processes each with responsibility for some of the channels and user processes, but here is discussed as though it is a single process. In multi node cases there is no inherent relationship between the number of processes providing Global Services and nodes - the number that are required will depend to some degree on the overall scale and nature of the user’s application.
Transport Agent
The Transport Agent is a process that is present, one per node, on all multi-node Dragon runtimes. It attaches to the shared memory segment created by the Local Services and routes messages destined to other nodes using a lower level communication mechanism, such as TCP or libfabric. There is no transport agent in single node deployments.
Communication Pathways
FIXME: This could use some more refinement.
There are various CommunicationComponents that need to be setup to get the Dragon runtime going.
Channels are the main mechanism to unify on-node and off-node communication of Dragon processes in the runtime. Dragon services communicate with each other using the Messages API through special infrastructure Channels. There is always at least one infrastructure channel per service present and except for bringup and teardown of the runtime, all communication between services runs through channels.
During Single Node Bringup or Multi Node Bringup, the Local Services allocates a segment of _POSIXSharedMemory to hold ManagedMemory. It then allocates a dedicated infrastructure managed memory pool and creates all infrastructure Channels into it. Every channel is then represented by a serialised descriptor that contains enough information about the channel, the managed memory allocation for the channel, and the managed memory pool. Every process can use the serialized descriptor to attach to and use the channel.
This effectively implements shared on-node shared memory for Dragon managed and un-managed processes.
MRNet, a tree-based software overlay network, is an open source project out of the University of Madison, WI. The Launcher uses its broadcast and reduce features service during Multi Node Bringup and Multi Node Teardown.
The stdin, stdout and stderr pipes of managed processes are captured by the Local Services. Some InfrastructureBootstrapping may in some cases involve information passed through the process’s stdin and stdout - this can remove some restrictions on the size of command lines and give a conventional way to handshake startup processing.
Conventional IDs
The Dragon infrastructure uses p_uid (p_uid
), c_uid (c_uid
), and m_uid
(m_uid
) to uniquely identify processes, channels, and memory pools in the runtime system. See
ConventionalIDs for more details.
Dragon Process Creation and Interaction
Dragon infrastructure Services are so-called unmanaged processes - namely, runtime support
processes that are not managed by the Global Services process. The category of managed processes
covers those that are created as a result of the code the user runs. This could be because the user creates a
process explicitly (such as instantiating a multiprocessing.Process
), implicitly (such as instantiating a
multiprocessing.Pool
), or as a result of creating a managed data structure for higher level communication.
Managed processes are always started by the Shepherd (see ProcessCreationAndInteraction) and are handed
a set of LaunchParameters as environment variables to define the Dragon environment.
Low-level Components
All Dragon processes (managed and unmanaged) are POSIX live processes using Dragon’s ManagedMemory API to share thread-safe memory allocations. During an allocation of managed memory from a memory pool, an opaque memory handle (descriptor) is created by the runtime and handed to the calling process. It can then be shared with any other Dragon process to attach to the memory pool and use the underlying object (e.g. channel). The runtime takes care of proper address translation between processes by storing only the offset from shared memory base pointer. Thread-safety of underlying the memory object is ensured by using Dragons Locks.
The Dragon infrastructure uses the following Components:
Locks: High performance locks to protect ManagedMemory pools.
ManagedMemory: Thread-safe memory pools holding their own state so they can be shared among processes using opaque handles.
UnorderedMap: A hash table implementation in ManagedMemory.
Broadcast: Any to many broadcaster to trigger events for a collection of processes flexibly.