Welcome to Project Dragon!

Overview

Dragon is a composable distributed run-time for managing dynamic processes, memory, and data at scale through high-performance communication objects that work transparently from laptop to supercomputer. Some of the key use cases for Dragon include distributed applications and workflows for analytics, HPC, and converged HPC/AI.

Dragon has a core, which consists of run-time services and foundational memory and communication primitives. Other Dragon components build on top of the core. Composability among the components makes Dragon a flexible ecosystem in which to build scalable solutions across a wide range of applications and workflows.

Dragon development has been organized into several areas of ongoing work as shown in Table 1. These key areas are seen as common to many of the needs of distributed computing at scale.

Table 1 Dragon Components

Name

Use Cases

Status

Core

Complete control over processes, memory, and communication

Ready

Python multiprocessing

Scale any Python multiprocessing program across nodes with little effort

Ready

Workflow Extensions

Higher level interface for defining workflows and adapters into workflow tools

Alpha

Data

Workloads needing a distributed KVS or mid-level cache

Beta

Telemetry

Introspection of Dragon, hardware utiliziation, user-injected real-time data

pre-Alpha

AI

Scalable data loaders for AI training, distributed training and resiliency

Alpha

There are a few comments about Table 1 that help provide context:

  • Dragon core is for advanced users. APIs may change slightly. However, it is mature code and has been tested and used very thoroughly.

  • Nearly all of Python multiprocessing becomes multi-node with Dragon’s run-time.

  • Dynamic workflow extensions requirements gathering and use cases are being investigated.

  • A distributed dictionary or Key/Value Store (KVS) is ready for beta use. Performance improvements will continue.

  • Telemetry work will help to provide feedback to users on performance and potential optimization of user code.

  • Dragon developers are working independently and with library developers to provide integration with AI libraries, testing, and performance at scale.

What distinguishes the Dragon project from many other projects is that Dragon has been designed from the ground up to be a powerful run-time to support distributed computing at scale and work continues to make it even more useful. After more than four years of design and development, we are happy to share this with the open source community and are grateful to Hewlett Packard Enterprise for supporting what we see as an open source project that helps fulfill a critical need in AI and high performance computing.

How Dragon Works

Dragon consists of a low-level set of core interfaces and high-level interfaces that are composed from the low-level components. Dragon’s run-time services manage the life-cycles of instances of core components, perform inter-node communication on behalf of user processes, and manage the deployment of the run-time across the nodes of a distributed system.

Dragon’s most basic components are Dragon Managed Memory and Dragon Channels as depicted in Fig. 1. Allocations from Managed Memory are used for each Dragon Channel, and a Dragon Channel is similar to a FIFO queue. All communication objects, both user and infrastructure, are based on Dragon Channels. Intra-server communication is done directly over shared memory, and inter-server communication is done with the assistance of a Transport Agent. User processes may have lifetimes that are independent of communication object lifetimes, and processes can utilize communication objects without knowledge of the physical location of the underlying Channels communication objects are composed from. The final basic component of Dragon is a Process. Dragon manages the life-cycle and placement of processes on any server the run-time is executing on. Global Services is the broker for all life-cycle requests related to Managed Memory, Channels, and Processes. It negotiates with the Local Services process running on a given node, which does the work of managing a component on its node. With this basic architecture, programs written for Dragon can transparently run on a single node, multi-node, or a multi-system environment with little-to-no changes.

Higher-level communication and synchronization objects supported by Dragon are implemented on top of these basic (i.e. core) Dragon components. User programs can take full advantage of their distributed system without having to deal with the details of location and inter-node communication. Where an object resides is functionally irrelevant because the Dragon core and transport service provide transparent, location independent access to objects that are composed of the Dragon core.

_images/dragon_deployment.jpg

Fig. 1 Dragon Infrastructure Architecture

Dragon’s implements most of Python’s multiprocessing library. The functionality that is not implemented includes managers. The sharedctypes of multiprocessing are limited to Value and Array in Dragon. Listeners and clients are not implemented in the Dragon version of multiprocessing. All the rest of multiprocessing is implemented by Dragon.

Why Dragon?

Dragon was originally conceived to address choices developers and researchers often face as their workloads become more demanding, “Is it worth the effort to rewrite this?” or “Must I use cumbersome interfaces just to get scaling and performance?”. Dragon does this by bringing performance and scaling to standard libraries of high-productivity languages. The Dragon run-time (a rudimentary distributed OS in user space) developed to support a scalable implementation of the standard Python multiprocessing API, can be adapted for many other uses as well through the composability of its components. Distributed orchestration of dynamic processes and data through efficient communication primitives, programmed through high-level APIs like Python multiprocessing, is simpler and more productive with Dragon.

Where to Next?

Follow the directions in Getting Started to install Dragon. Then consult the Users Guide and Solution Cookbook to familiarize yourself with using Dragon. Most users will want to install Dragon and then refer to Python multiprocessing to learn what you can do with multiprocessing and Dragon’s multi-node implementation!

Contents

Indices and tables