Dragon Core

The Dragon Core API supports much of the rest of the Dragon run-time services. It can also be programmed to directly by knowledgable programmers. The API has at its core two types of objects, shared memory allocations and channels. Written in C, the API is interoperable with other languages, including C, C++, Python, and Fortran (C++ and Fortran bindings are currently not available as of this writing).

The following sections provide an overview of the major components of the Dragon Core. The sections following this introduction provide a detailed description of the API and objects that are a part of the Dragon Core.

Managed Memory

../../_images/managed_memory.png

Fig. 20 Conceptual View of Managed Memory

As depicted in Fig. 20, shared memory allocations come from memory pools. Both memory pools and memory pool allocations are part of the shared memory API. POSIX shared memory provides the underlying mechanism through which managed memory is implemented. Managed memory pools consist of two parts, a manifest and a data. The manifest is where all meta-data about the managed memory pool is stored. The data part is where allocations are made. The data part is managed by a heap manager whose state is maintained in the manifest. The data is split up into segments which are the smallest allocatable block of data within the data part. The size of the memory pool, size of segments, and number of segments are all configurable at run-time.

The heap manager, a sub-component of managed memory, allocates data from the pool by allocating one or more contiguous segments that together are as big or bigger than a requested allocation. Should the managed memory pool be full or unable to make an allocation of a specific size, but may be able to in the future once other allocations are freed, allocation requests may block the requesting process until the allocation is available. Processes decide whether to use blocking allocation calls or not and decide on the amount of time they are willing to wait.

Managed memory allocation is thread and interprocess safe. Interprocess locking guarantees that critical sections are executed without interleaving between processes. Managed memory is an interprocess and thread-safe dynamic memory allocation framework.

To share a managed memory allocation a serialized descriptor can be created and shared. Other processes which receive a copy of the serialized descriptor can then attach to the same shared memory within their own process. This concept of serializing an object to be shared with other processes that can then attach to it is widespread within Dragon. Managed memory, channels, and things that are built from these fundamental building blocks all can be shared by creating a serialized representation that can then be attached in other processes.

Managed memory allocations can be attached, even off-node. However, memory allocations are hosted in shared memory of a single node. Attaching to a managed memory allocation is different than getting a pointer into the data. Managed memory descriptors can be passed around at will in the Dragon run-time and between Dragon managed processes. But, a process must be running on the node where the pool exists to obtain a pointer into a managed memory allocation.

When the Dragon run-time is started, two managed memory pools are created per node to facilitate communication within the run-time. An infrastructure pool is created that is used strictly by infrastructure supporting code internal to Dragon. A default pool is also allocated which is used by APIs when making internal allocations on behalf of a user.

Channels

As shown in Fig. 20, a channel, at its most fundamental level, is an allocation in a managed memory pool. Channels only exist inside managed memory pools. All channels are allocated out of some managed memory pool. Internally, as shown at a very high level in Fig. 21 a channel is composed of finite number of blocks. Both the number of blocks and the block size are configurable at run-time.

../../_images/channel.png

Fig. 21 Conceptual View of a Channel’s Structure

The channels API provides an organized means of synchronizing and communicating between processes. Channels provides a queue-like interface between these processes both on-node and off-node (when used in conjunction with a transport service). Channels are a thread and interprocess-safe queue where messages can be sent and received.

A channel, like managed memory, can be serialized and shared with other processes. A channel, like managed memory, can also be attached both on-node and off-node. The channels library itself recognizes when a channel is located off-node and coordinates with a gateway channel which in turn is serviced by a transport service to provide communication from a process that is located on a node different from the channel with which it is communicating.

Messages are sent into channels and are captured inside a channel block. Each block either contains a message or contains a reference to a message if the message is too big to be contained in the block. When a message is sent, an unallocated block is allocated and added to the list of allocated blocks to be dequeued in sequential order. When a block is unavailable, a process may wish to block until a block is available to send its message.

Channels provide both blocking and non-blocking semantics for sending and getting messages. When sending, blocking can be chosen to wait when memory from a pool is needed and not available. When getting messages, blocking receives will wait until a message is available. All blocking operations within Dragon can either idle wait, consuming fewer resources and energy, or spin wait, with relatively no wakeup cycles. Channels are highly configurable and customizable for whatever situation they are used in at the expense of being a low-level primitive synchronization and communication construct.

When a message is dequeued the relevant message payload is copied from the block and its corresponding block is added to the list of unallocated blocks to be used again. In this way a channel’s blocks are re-used during its lifetime.

During the enqueue operation (called send_msg in the API) message data is copied into a block and during the dequeue operation (called get_msg in the API) message data is copied out of the block. For large messages the contents of a block is a serialized descriptor of the message and therefore the actual message contents may not be copied but rather shared in a controlled manor between processes providing the potential for a zero copy transfer of data between processes when both processes are located on-node.

Out of these basic concepts all other, higher level objects are constructed. Both channels and managed memory also rely on even more primitive sub-components to do their work. These sub-components provide locking, blocking, heap management, and various other services that are needed by these two fundamental building blocks, but managed memory and channels are the fundamental building blocks of all higher level objects like the File Like Interface described in the next section.

Further Information

Dragon Channels is the low-level transport layer for communicating messages between POSIX processes in the Dragon runtime. The Dragon run-time services itself uses channels for communication between processes. User programs, utilizing the Dragon run-time services also use channels either directly, or indirectly. This API would be used directly when a program wants to communicate in the most efficient way possible while being willing to give up the services provided by higher level communication and synchronization abstractions like FLIs and even higher level abstractions like queues. Choosing to communicate at the channels level will mean giving up features like pickling and unpickling, automatic serialization of data, and the automatic streaming of data over a channel.

A major advantage of Channels is that they retain the flexibility of using sockets or a filesystem while enabling zero-copy on-node transfers, single-copy RDMA-enabled transfers off-node, and choices for blocking semantics. There is a rich set of buffer management options that enable use cases such as:

  • static target-side managed payload landing pads much like SHMEM or one-sided MPI

  • dynamic target-side managed payload landing pads much like two-sided MPI

  • static or dynamic origin-side managed payload landing pad, which nothing else has

Dragon Channels can reside in any valid dragonMemoryDescr_t as provided by Dragon Memory Pools. This includes shared memory, a filesystem, or private virtual memory. Payload buffers for messages can reside within a channel, a memory pool dragonMemoryPoolDescr_t the channel was allocated from, or any valid dragonMemoryDescr_t passed with a message. This design allows Channels to provide multiple usage scenarios with different performance and persistence characteristics.

Example

An example can be found here. The example illustrates how to use the C Channels API.

Channels provide flexible on-node and communication that processes use by attaching and detaching to the underlying Managed Memory. When Channels are used in conjunction with a transport agent, transparent off-node communication is also provided when sending and receiving messages using the Channels API. In this respect, Channels resemble POSIX sockets as they are always available as a service and not built upon the static distributed model that MPI or SHMEM is. Transparency is provided because the exact same Channels API calls work for both on-node and off-node communication. The user program does not change when communicating off-node or on-node.

File Like Interface

The File Like Interface (i.e. FLI) provides an abstraction over channels of 1:1 communication between two end points. The abstraction supports streaming data over a connection like interface. Sending and receiving data is done by opening a send or receive handle. Once a send and a receive handle is opened, a 1:1 connection exists between sender and receiver. Data can then be streamed over the connection with a guarantee that data from other senders and receivers will not interrrupt the stream between the two end points.

There is no restriction about the location of the two end points. They may exist in different processes on different nodes using channels that may be co-located on the sending or receiving side, but are not required to be. There are certain benefits to co-location, but no requirements for it in the FLI API.

../../_images/fli_main.png

Fig. 22 File Like Interface Overview

The FLI is implemented as a channel of stream channels. A Stream Channel is a channel that is designated for one sender and one receiver to share while carrying on a streaming conversation. When the conversation is over, the stream channel is recycled according to the FLI protocol.

When an FLI is created, a Main Channel and a Manager Channel are created in the nominal case. When the FLI is created, there are some number of stream channels that are supplied. These stream channels are serialized and their serialized descriptors are placed in the manager channel. The manager channel is then a channel of channel descriptors.

When a Sender comes along, it opens a send handle on the FLI which then receives a serialized channel descriptor for a stream channel from the manager channel. A sender looks for stream channels in the manager channel and attempts to dequeue it when the send handle is opened.

At the moment the send handle is opened, the serialized representation of it is then written into the Main Channel. This means that as soon as there is a sender, the stream channel becomes available to a receiver to begin receiving streaming data. A sender and receiver can be carrying on their 1:1 conversation as soon as a sender is opens a send handle.

When a Receiver opens a receive handle, the FLI API receives a stream channel from the main channel and adds it to the receive handle. The receiver can begin receiving data on the stream channel at that point.

When a sender closes a send handle a special message is sent to the receiving side to indicate that the conversation is over. This results in the receiver receiving a return code of DRAGON_EOT to indicate that the end of the conversation has been reached.

Once the receiver has received DRAGON_EOT it must respond by closing the receive handle. When the receive handle is closed, the stream channel is recycled by enqueuing a serialized descriptor for it back into the Manager Channel.

Controlling the Placement of Data

One nuance of the FLI design can help in the placement of data within a pool. When data is sent to a channel, either on-node or off-node, the data must be copied into a managed memory pool while it is transferred through the stream channel.

In the absence of further information about a pool from which to allocate space, the pool of the channel being sent to will be used. If the user application wishes to optimally handle data by placing it in a pool and minimizing copies of that data, then creating the stream channels from the same pool will mean that the FLI will automatically place data that was sent to a stream channel into the same pool as the stream channel and therefore into a pool chosen by the user. The net result of this rather complicated explanation is that if the receiver creates the FLI then it may be beneficial to first create a pool and stream channels from that same pool to optimally minimize the copying of data.

Example

Here is some sample code for creating an FLI using only the C interface. This example code was taken from the test/channel_subtests/test_fli.c program. Note that for this to be useful production code, resources should be created using a means of guaranteeing unique CUIDs for channels and if needed, unique MUIDs for pools. For instance, you might use the dragon_create_process_local_channel API call to create guaranteed unique CUIDs for you channels.

Listing 35 Creating an FLI
 1#include <dragon/channels.h>
 2#include <dragon/fli.h>
 3
 4#define M_UID 0
 5#define POOL_M_UID 2
 6#define POOL "fli_test"
 7#define NUM_CHANNELS 10
 8#define MAX_STREAM_SIZE 500
 9
10dragonError_t create_pool(dragonMemoryPoolDescr_t* mpool) {
11    /* Create a memory pool to allocate messages and a Channel out of */
12    size_t mem_size = 1UL<<31;
13
14    dragonError_t err = dragon_memory_pool_create(mpool, mem_size, POOL, POOL_M_UID, NULL);
15    if (err != DRAGON_SUCCESS)
16        err_fail(err, "Failed to create memory pool");
17
18    return DRAGON_SUCCESS;
19}
20
21dragonError_t create_channels(dragonMemoryPoolDescr_t* mpool, dragonChannelDescr_t channel[], int arr_size) {
22    int k;
23    dragonError_t err;
24
25    for (k=0;k<arr_size;k++) {
26        /* Create the Channel in the memory pool */
27        err = dragon_channel_create(&channel[k], k, mpool, NULL);
28        if (err != DRAGON_SUCCESS)
29            err_fail(err, "Failed to create a channel");
30    }
31
32    return DRAGON_SUCCESS;
33}
34
35
36int main() {
37    dragonError_t err;
38    dragonMemoryPoolDescr_t pool;
39    dragonChannelDescr_t channels[NUM_CHANNELS];
40    dragonChannelDescr_t* channel_ptrs[NUM_CHANNELS];
41    dragonFLIDescr_t fli;
42    dragonFLISerial_t ser;
43
44    /* Creating the channel set and the pool */
45    for (int k=0;k<NUM_CHANNELS;k++)
46        channel_ptrs[k] = &channels[k];
47
48    /* Always check error codes - omitted for brevity */
49    err = create_pool(&pool);
50    err = create_channels(&pool, channels, NUM_CHANNELS);
51    err = dragon_fli_create(&fli, channel_ptrs[0], channel_ptrs[1], &pool, NUM_CHANNELS-2, &channel_ptrs[2], false, NULL);
52    err = dragon_fli_serialize(&fli, &ser);
53
54    /* The serialized fli named ser can then be base64 encoded and shared
55       with other processes where it can then be base64 decoded
56       and attached to other processes within the dragon run-time. */
57
58    ...
59}

Variations on Creation of FLIs

There are several variations possible on the creation of FLI connections. The following sections provide a comprehensive list of these possible variations. In summary, the variations are:

  • Buffered FLI

  • Sender Supplied Stream Channels

  • Receiver Supplied Stream Channels

  • Main Channel as Stream Channel

These variations are all different from the nominal case presented in the previous section.

Buffered FLI

On creation, it is possible to create an FLI that is designated as buffered. This is done as follows.

Listing 36 A Buffered FLI
 1int main() {
 2    dragonError_t err;
 3    dragonChannelDescr_t channels[NUM_CHANNELS];
 4    dragonChannelDescr_t* channel_ptrs[NUM_CHANNELS];
 5    dragonFLIDescr_t fli;
 6
 7    for (int k=0;k<NUM_CHANNELS;k++)
 8        channel_ptrs[k] = &channels[k];
 9    /* Always check error codes - omitted for brevity */
10    err = create_pool(&pool);
11    err = create_channels(&pool, channels, NUM_CHANNELS);
12    err = dragon_fli_create(&fli, channel_ptrs[0], NULL, &pool, 0, NULL, true, NULL);

In Listing 36 true is passed for the value of the use_buffered_protocol argument. Also in this case there are no specified stream channels and no manager channel. In this case, while the FLI continues to operate using the normal FLI API, all writes to a send handle are buffered until the send handle is closed. This means that no streaming occurs in this case, but one channel, the main channel, can be used by multiple senders and receivers and conversations between opening and closing a receive handle will consist of one receive operation on the underlying channel. However, there may be multiple read operations performed on the received chunk of data when less than the whole chunk is being read. For instance, the buffered send my send 128 bytes of data. A reader may read 32 bytes resulting in 96 bytes of data left to read. Once the received data is exhausted by read operations, DRAGON_EOT will be returned.

To reiterate, with a buffered FLI, multiple writes are possible and multiple reads are possible, but the underlying data is all sent in one buffered chunk. By sending all the conversations data within one buffered chunk, multiple conversations can peacefully co-exist within the same channel.

Sender Supplied Stream Channels

An application designer may decide that senders should all have their own dedicated stream channels. In the nominal case, senders pick up a stream channel from the manager channel. If the design of the application is to have each sender have its own dedicated stream channel, then a manager channel is not necessary.

../../_images/fli_sender.png

Fig. 23 File Like Interface with Sender Supplied Stream Channels

Listing 37 demonstrates creating an FLI with the intention of providing sender supplied stream channels. This is allowed because it is created with a main channel. No manager channel is required in this case.

Listing 37 Creating a Send Handle with a Sender Supplied Stream Channel
 1int main() {
 2    dragonError_t err;
 3    dragonChannelDescr_t channels[NUM_CHANNELS];
 4    dragonChannelDescr_t* channel_ptrs[NUM_CHANNELS];
 5    dragonFLIDescr_t fli;
 6    dragonFLISendHandleDescr_t sendh;
 7
 8    for (int k=0;k<NUM_CHANNELS;k++)
 9        channel_ptrs[k] = &channels[k];
10    /* Always check error codes - omitted for brevity */
11    err = create_pool(&pool);
12    err = create_channels(&pool, channels, NUM_CHANNELS);
13    err = dragon_fli_create(&fli, channel_ptrs[0], NULL, &pool, 0, NULL, false, NULL);
14    err = dragon_fli_open_send_handle(&fli, &sendh, channel_ptrs[1], NULL);

When opening the send handle, a sender supplied stream channel is specified. Opening the send handle places the stream channel’s serialized descriptor into the main channel to be picked up by a receiver. NOTE: While both sender supplied stream channels and receiver supplied stream channels are possible, one or the other must be chosen and adhered to in a particular FLI. If stream channels are sender supplied, the receivers should not specify a stream channel when opening a receive handle. Doing so would mean that senders and receivers would not talk to each other.

Receiver Supplied Stream Channels

In some cases it may be desirable for each receiver to have a dedicated stream channel. In the nominal case, receivers pick up a stream channel from the main channel. If the design of the application is to have each receiver have its own dedicated stream channel, then a main channel is not necessary.

../../_images/fli_receiver.png

Fig. 24 File Like Interface with Receiver Supplied Stream Channels

Listing 38 demonstrates creating an FLI with the intention of providing receiver supplied stream channels. This is allowed because it is created with a manager channel. No main channel is required in this case.

Listing 38 Creating a Receive Handle with a Receiver Supplied Stream Channel
 1int main() {
 2    dragonError_t err;
 3    dragonChannelDescr_t channels[NUM_CHANNELS];
 4    dragonChannelDescr_t* channel_ptrs[NUM_CHANNELS];
 5    dragonFLIDescr_t fli;
 6    dragonFLIRecvHandleDescr_t recvh;
 7
 8    for (int k=0;k<NUM_CHANNELS;k++)
 9        channel_ptrs[k] = &channels[k];
10    /* Always check error codes - omitted for brevity */
11    err = create_pool(&pool);
12    err = create_channels(&pool, channels, NUM_CHANNELS);
13    err = dragon_fli_create(&fli, NULL, channel_ptrs[0], &pool, 0, NULL, false, NULL);
14    err = dragon_fli_open_recv_handle(&fli, &recvh, channel_ptrs[1], NULL);

When opening the receive handle, a receiver supplied stream channel is specified. Opening the receive handle places the stream channel’s serialized descriptor into the manager channel to be picked up by a sender. NOTE: While both sender supplied stream channels and receiver supplied stream channels are possible, one or the other must be chosen and adhered to in a particular FLI. If stream channels are receiver supplied, the senders should not specify a stream channel when opening a send handle. Doing so would mean that senders and receivers would not talk to each other.

When a sender opens a send handle in this case, the API dequeues from the manager channel and uses the receiver supplied stream channel found there. The user-supplied stream channel is not returned to a main channel in this case since the receiver is already receiving from its dedicated channel.

Main Channel as Stream Channel

In certain situations it may be known that there is only one sender and one receiver that are going to use an FLI interface. If that is the case, then management of the FLI is not essential to managing the 1:1 conversation, but it may be desirable to use the FLI interface anyway. In such situations, the main channel can be used as a stream channel so data may be streamed between sender and receiver without the need for extra channels.

Listing 39 Creating an FLI for One Sender and One Receiver
 1int main() {
 2    dragonError_t err;
 3    dragonChannelDescr_t channels[NUM_CHANNELS];
 4    dragonChannelDescr_t* channel_ptrs[NUM_CHANNELS];
 5    dragonFLIDescr_t fli;
 6    dragonFLISendHandleDescr_t sendh;
 7    dragonFLIRecvHandleDescr_t recvh;
 8
 9    for (int k=0;k<NUM_CHANNELS;k++)
10        channel_ptrs[k] = &channels[k];
11    /* Always check error codes - omitted for brevity */
12    err = create_pool(&pool);
13    err = create_channels(&pool, channels, NUM_CHANNELS);
14    err = dragon_fli_create(&fli, channel_ptrs[0], NULL, &pool, 0, NULL, false, NULL);
15    err = dragon_fli_open_send_handle(&fli, &sendh, STREAM_CHANNEL_IS_MAIN_FOR_1_1_CONNECTION, NULL);
16    /* and separately on the receiver */
17    err = dragon_fli_open_recv_handle(&fli, &recvh, STREAM_CHANNEL_IS_MAIN_FOR_1_1_CONNECTION, NULL);

The code in Listing 39 shows that both the sender and receiver must declare that there is one receiver and one sender by specifying the constant STREAM_CHANNEL_IS_MAIN_FOR_1_1_CONNECTION when the send and receive handles are opened.

Sending and Receiving Data

Once a send handle is opened, there are several ways to send data over the send handle. You can send bytes or you can send managed memory directly. You can also open a file descriptor over a send handle and use the file descriptor by writing to it or passing it to existing code that wishes to write to a file descriptor.

When a receive handle is opened, there are even more options for reading the data. You can read bytes or you can read bytes into an existing location. You can also user the lower-level receive of managed memory objects. For the two read bytes options the FLI manages read operations by not returning more data than was requested - while less may be returned when data is finally exhausted. The FLI will buffer unread data until it is read on subsequent read operations until the data is exhausted.

Like the writing side, the reading of data can also be done via a file descriptor. On both the reading and writing side, once the file descriptor writing and reading is done, the appropriate file descriptor finalize function should be called to guarantee that the file descriptor is flushed and cleaned up correctly.

See the API section on reading and writing data for more detailed discussions on how to use this part of the API.

User Supplied Arguments

When writing to an FLI it is possible to specify an extra argument which is a 64-bit user-supplied value that exists as meta-data for the write operation and not a part of the data itself. This meta-data is received by the receiving side and can be used by an application as needed. This is not unlike certain POSIX calls that allow a user supplied argument to be passed, but for the FLI it is limited to 64-bits and should not be a pointer unless sender and receiver are both in the same process (since address spaces are not typically shared between processes and further since Dragon is a distributed run-time where shared pointers would not always make any sense since they don’t necessarily even have a common shared memory).

For more information, consult the arg argument in the write and read operations.

C Reference

Python Reference