Infrastructure Architecture

FIXME: This should not be about Python Multiprocessing

Dragon applications whether deployed on single node or multi node, require runtime communication and global namespace services that are managed by separate processes from user code. Fundamentally, the purpose of these services is to manage, in user space, operations that the legacy implementation of Python multiprocessing relies on the operating system to do, usually in the form of file descriptors, which is not scalable and does not work in a distributed system. While very portable across time and operating system implementations, a file descriptor based approach doesn’t offer the best performance and scalability.

Fig. 40 Dragon Runtime Architecture in a multi-node deployment

There are various actors involved in the Dragon runtime, they are shown in Fig. 40. Although they will be specified in more detail in other documents, we list them here and summarize their function. Whenever necessary, the user program is called my.py and is started from the command line by invoking the Launcher with dragon my.py. In particular, no preliminary interaction with the system’s workload manager (if present) is required.

Dragon API

User Programs always interact with Dragon using the DragonNative component and its API. The mpbridge implementation of multiprocessing encapsulates the Dragon Native API and provides higher level abstractions built on top of the Dragon Native API calls. The user may choose to interact solely with multiprocessing or may choose to use multiprocessing and the Dragon Native API in concert with one another.

The Dragon Run-Time Services consist of the following top-level components:

Launcher

The Launcher service is the dragon executable that brings up the other pieces of the runtime from nothing and arranges for the user program (e.g. my.py) to start executing. It consists of a frontend and backend component. The backend communicates with the frontend to provide all the functionality of the Launcher on compute-nodes. In particular, it routes traffic the node local Local Services process may have to the frontend and connects traffic that the launcher frontend may have to Channels. In the single-node case, the launcher frontend and backend talk directly to each other. In the multi-node case, they use MRNet to communicate. See Multi Node Deployment and Single Node Deployment for more details on the deployment process.

Local Services

The Local Services process is the direct parent of all managed processes as well as some of the infrastructure processes.

Local Services’ responsibilities include:

creating a named shared memory pools (default and infrastructure pool) for interprocess communication during bringup

instantiating infrastructure channels in that segment.

starting managed and unmanaged processes on its node

aggregating, filtering, and forwarding stdout and stderr from the processes it has started ultimately back to the launcher

propagating messages to a process’s stdin from a channel

In the single node case, there is exactly one shepherd. In a multi node case, there is exactly one shepherd process per node. See Process Creation and Interaction for more details on process creation.

Global Services

Global Services maintains a global namespace and tracks the state of global objects in a Dragon program, which include managed processes and channels. This is done on the Python level using the API FIXME: add link – but fundamentally this is a message based service and can be interacted with by non Python programs. Global Services will ultimately be distributed over a hierarchy of service processes each with responsibility for some of the channels and user processes, but here is discussed as though it is a single process. In multi node cases there is no inherent relationship between the number of processes providing Global Services and nodes - the number that are required will depend to some degree on the overall scale and nature of the user’s application.

Transport Agent

The TransportAgent is a process that is present, one per node, on all multi-node Dragon runtimes. It attaches to the shared memory segment created by the Local Services and routes messages destined to other nodes to the counterpart transport agent on that node using a lower level communication mechanism such as MPI or libfabric. There are many reasons (FIXME: Which ones ? Link to introduction for design decisions ?) why this is necessary instead of accessing this lower level mechanism directly from user python processes - particularly that these libraries don’t easily support dynamic process creation and deletion. It also has the responsibility of exposing any special system wide synchronization constructs (such as a barrier) to the runtime, typically a special type of channel. There is no transport agent in single node deployments.

Communication Pathways

FIXME: This could use some more refinement.

There are various CommunicationComponents that need to be setup to get the Dragon runtime going.

Channels are the main mechanism to unify on-node and off-node communication of Dragon processes in the runtime. Dragon services communicate with each other using the Infrastructure Messages API through special infrastructure Channels. There is always at least one infrastructure channel per service present and except for bringup and teardown of the runtime, all communication between services runs through channels.

During Single Node Bringup or Multi Node Bringup, the Local Services allocates a segment of _POSIXSharedMemory to hold Managed Memory. It then allocates a dedicated infrastructure managed memory pool and creates all infrastructure Channels into it. Every channel is then represented by a serialised descriptor that contains enough information about the channel, the managed memory allocation for the channel, and the managed memory pool. Every process can use the serialized descriptor to attach to and use the channel.

This effectively implements shared on-node shared memory for Dragon managed and un-managed processes.

MRNet, a tree-based software overlay network, is an open source project out of the University of Madison, WI. The Launcher uses its broadcast and reduce features service during Multi Node Bringup and Multi Node Teardown.

The stdin, stdout and stderr pipes of managed processes are captured by the Local Services. Some Infrastructure Bootstrapping may in some cases involve information passed through the process’s stdin and stdout - this can remove some restrictions on the size of command lines and give a conventional way to handshake startup processing.

Conventional IDs

The Dragon infrastructure uses Process ID (p_uid), Channel ID (c_uid), and Memory Pool ID (m_uid) to uniquely identify processes, channels, and memory pools in the runtime system. See Conventional IDs for more details.

Dragon Process Creation and Interaction

Dragon infrastructure Services are so-called unmanaged processes - namely, runtime support processes that are not managed by the Global Services process. The category of managed processes covers those that are created as a result of the code the user runs. This could be because the user creates a process explicitly (such as instantiating a multiprocessing.Process), implicitly (such as instantiating a multiprocessing.Pool), or as a result of creating a managed data structure for higher level communication. Managed processes are always started by the Shepherd (see Process Creation and Interaction) and are handed a set of Launch Parameters as environment variables to define the Dragon environment.

Low-level Components

All Dragon processes (managed and unmanaged) are POSIX live processes using Dragon’s Managed Memory API to share thread-safe memory allocations. During an allocation of managed memory from a memory pool, an opaque memory handle (descriptor) is created by the runtime and handed to the calling process. It can then be shared with any other Dragon process to attach to the memory pool and use the underlying object (e.g. channel). The runtime takes care of proper address translation between processes by storing only the offset from shared memory base pointer. Thread-safety of underlying the memory object is ensured by using Dragons Locks.

The Dragon infrastructure uses the following Low-level Components:

Locks: High performance locks to protect Managed Memory pools.
Managed Memory: Thread-safe memory pools holding their own state so they can be shared among processes using opaque handles.
Unordered Map: A hash table implementation in Managed Memory.
Broadcast Component: Any to many broadcaster to trigger events for a collection of processes flexibly.