Infrastructure Architecture

Fig. 15 Dragon Runtime Architecture in a multi-node deployment

There are various actors involved in the Dragon runtime, they are shown in Fig. 15. Although they will be specified in more detail in other documents, we list them here and summarize their function. Whenever necessary, the user program is called my.py and is started from the command line by invoking the Launcher with dragon my.py. In particular, no preliminary interaction with the system’s workload manager (if present) is required.

Launcher

The Launcher service is the dragon executable that brings up the other pieces of the runtime from nothing and arranges for the user program (e.g. my.py) to start executing. It consists of a frontend and backend component. The backend communicates with the frontend to provide all the functionality of the Launcher on compute-nodes. In particular, it routes traffic the node local Local Services process may have to the frontend and connects traffic that the launcher frontend may have to Channels. In the single-node case, the launcher frontend and backend talk directly to each other. In the multi-node case, they use the TCP Transport to communicate. See Multi Node Deployment and Single Node Deployment for more details on the deployment process.

Local Services

The Local Services process is the direct parent of all managed processes as well as some of the infrastructure processes.

Local Services’ responsibilities include:

creating a named shared memory pools (default and infrastructure pool) for interprocess communication during bringup

instantiating infrastructure channels in that segment.

starting managed and unmanaged processes on its node

aggregating, filtering, and forwarding stdout and stderr from the processes it has started ultimately back to the launcher

propagating messages to a process’s stdin from a channel

In the single node case, there is exactly one shepherd. In a multi node case, there is exactly one shepherd process per node. See ProcessCreationAndInteraction for more details on process creation.

Global Services

Global Services maintains a global namespace and tracks the state of global objects in a Dragon program, which include managed processes and channels. This is done on the Python level using the API FIXME: add link – but fundamentally this is a message based service and can be interacted with by non Python programs. Global Services will ultimately be distributed over a hierarchy of service processes each with responsibility for some of the channels and user processes, but here is discussed as though it is a single process. In multi node cases there is no inherent relationship between the number of processes providing Global Services and nodes - the number that are required will depend to some degree on the overall scale and nature of the user’s application.

Transport Agent

The Transport Agent is a process that is present, one per node, on all multi-node Dragon runtimes. It attaches to the shared memory segment created by the Local Services and routes messages destined to other nodes using a lower level communication mechanism, such as TCP or libfabric. There is no transport agent in single node deployments.

Communication Pathways

FIXME: This could use some more refinement.

There are various CommunicationComponents that need to be setup to get the Dragon runtime going.

Channels are the main mechanism to unify on-node and off-node communication of Dragon processes in the runtime. Dragon services communicate with each other using the Messages API through special infrastructure Channels. There is always at least one infrastructure channel per service present and except for bringup and teardown of the runtime, all communication between services runs through channels.

During Single Node Bringup or Multi Node Bringup, the Local Services allocates a segment of _POSIXSharedMemory to hold ManagedMemory. It then allocates a dedicated infrastructure managed memory pool and creates all infrastructure Channels into it. Every channel is then represented by a serialised descriptor that contains enough information about the channel, the managed memory allocation for the channel, and the managed memory pool. Every process can use the serialized descriptor to attach to and use the channel.

This effectively implements shared on-node shared memory for Dragon managed and un-managed processes.

MRNet, a tree-based software overlay network, is an open source project out of the University of Madison, WI. The Launcher uses its broadcast and reduce features service during Multi Node Bringup and Multi Node Teardown.

The stdin, stdout and stderr pipes of managed processes are captured by the Local Services. Some InfrastructureBootstrapping may in some cases involve information passed through the process’s stdin and stdout - this can remove some restrictions on the size of command lines and give a conventional way to handshake startup processing.

Conventional IDs

The Dragon infrastructure uses p_uid (p_uid), c_uid (c_uid), and m_uid (m_uid) to uniquely identify processes, channels, and memory pools in the runtime system. See ConventionalIDs for more details.

Dragon Process Creation and Interaction

Dragon infrastructure Services are so-called unmanaged processes - namely, runtime support processes that are not managed by the Global Services process. The category of managed processes covers those that are created as a result of the code the user runs. This could be because the user creates a process explicitly (such as instantiating a multiprocessing.Process), implicitly (such as instantiating a multiprocessing.Pool), or as a result of creating a managed data structure for higher level communication. Managed processes are always started by the Shepherd (see ProcessCreationAndInteraction) and are handed a set of LaunchParameters as environment variables to define the Dragon environment.

Low-level Components

All Dragon processes (managed and unmanaged) are POSIX live processes using Dragon’s ManagedMemory API to share thread-safe memory allocations. During an allocation of managed memory from a memory pool, an opaque memory handle (descriptor) is created by the runtime and handed to the calling process. It can then be shared with any other Dragon process to attach to the memory pool and use the underlying object (e.g. channel). The runtime takes care of proper address translation between processes by storing only the offset from shared memory base pointer. Thread-safety of underlying the memory object is ensured by using Dragons Locks.

The Dragon infrastructure uses the following Components:

Locks: High performance locks to protect ManagedMemory pools.
ManagedMemory: Thread-safe memory pools holding their own state so they can be shared among processes using opaque handles.
UnorderedMap: A hash table implementation in ManagedMemory.
Broadcast: Any to many broadcaster to trigger events for a collection of processes flexibly.