Process Creation and Interaction

Here we describe the creation and interaction of managed POSIX processes by the Dragon runtime Infrastructure Services.

Launch Parameters

The Dragon runtime has a number of parameters that may be a constant or determined at job launch time. These need to be accessed in a mutually coherent way by a number of different runtime components:

Launcher

on login node

Source of some parameters, such as number of back end nodes

MRNet Backend, one per node

Local Services, one per node

TransportAgent, one per node

Global Services head process (plus any more it may launch)

managed user processes

Here we specify how these parameters are to be supplied, and what their names and types are. For now, these parameters are viewed as potentially changing between separate user launches but are immutable from the point of view of the lifetime of any one component process. They may not necessarily all be identical across processes.

Providing parameters

For the time being, parameters will be provided to launched components as POSIX environment variables. This may change to another scheme in the future.

Each parameter listed below, when provided as an environment variable, will have the string DRAGON_ prepended to its name.

Accessing parameters

All code that uses parameters will do this through a defined interface and not through hard-coding them or directly accessing wherever they happen to be sourced from. This lets us change policy if need be as to where these things ultimately come from.

Parameters are accessed in Python through the dragon.infrastructure.parameters module which defines the LaunchParameters object.

This module contains dragon.infrastructure.parameters.this_process singleton instance which should be the only way other Python code interacts with the launch parameters.

This object is initialized, from the environment, when the parameters module is imported into a process and is meant to contain immutable data. Changes to the this_process object’s attributes do not change the process’s actual environment. Instead, the this_process object has a method that returns the environment variable name and value for each parameter in a dictionary as strings. In this way one can collect the environment variables that need to be set when starting a new process.

The LaunchParameters object can be instantiated separately from the this_process singleton. One may do a deep copy on the this_process variable, adjust what needs adjusting in the copy by changing the members directly, then ask the object to produce a dictionary of environment variables. If these variables are set for a new process, the corresponding this_process singleton object will match with what has been prepared.

FIXME: proper API documentation and cross links FIXME: define C interface for this stuff

the keys will be (at some level) strings, but they should not be mentioned all over the code, but only as const char* defined in one place.

could be as simple as initializing a struct.

Parameter organization

The general structure of parameters is as key-value pairs, where the key is name of the parameter.

Types

Parameter values can have integer or string type.

Everything will be in utf-8 encoding.

parameter names will be legal C99 identifiers

Integer parameters will be provided in base 10

with no leading, trailing, or interstitial spaces

e.g. as the “%d” C99 printf format specifier

Parameters that for some reason need richer typing than integer and string will be provided as JSON encoded strings.

We will avoid having floating point parameters by defining units as part of the parameter - that is, instead of a floating point number of seconds, an integer number of Planck time units.

Individual parameters may have their own range restrictions beyond the basic type and these should be checked up front if possible when building the interface object.

For instance, an integer may be expected to be a nonnegative integer, or less than another integer.

List of parameters

This list may be added to in the future. These values include values needed for bootstrapping the Dragon services during startup and these environment variables also contain values needed by user processes that are executing under the supervision of the Dragon runtime services.

FIXME Add source and sink of these variables to improve clarity

Name	Type	Description
MODE	string	set to ‘hsta’ if it is a multi-node run. Otherwise it is set to ‘single’. Used during bootstrapping.
ALLOC_SZ	integer >0	number back end nodes in allocation.
INDEX	integer >=0	which node in the allocation this process is on
PRIMARY_INDEX	integer >=0	which node in the allocation is designated as the primary node. Global services runs on the primary node.
DEFAULT_PD	string	default local pool descriptor for process allocations on behalf of the Dragon runtime services (base64).
INF_PD	string	infrastructure local pool desc. Used strictly by Dragon services (base64).
LOCAL_SHEP_CD	string	channel descriptor (base64) for the local Shepherd’s main queue.
LOCAL_BE_CD	string	channel descriptor (base64) for the launcher’s backend main queue.
GS_RET_CD	string	channel descriptor (base64) for GS responses to this process from GS API calls.
SHEP_RET_CD	string	channel descriptor (base64) for Shep responses to this process from Shepherd API calls.
GS_CD	string	channel descriptor (base64) for the Global Services’ main recv queue.
GS_MAX_MSGS	integer	The maximum number of allowed msgs in the main Global Services channel.
SH_MAX_MSGS	integer	The maximum number of allowed msgs in the main Shepherd channel.
DEFAULT_SEG_SZ	integer >0	size of the default shared memory segment. Used during bootstrapping.
INF_SEG_SZ	integer >0	size of the infrastructure shared memory segment. For bootstrapping.
TEST	integer >=0	as bool; indicates ‘test mode’ to bypass some initializations
DEBUG	integer >=0	integers greater than 0 indicate increased levels of logging for debug purposes. 0 indicates no logging. This is a placeholder for now.
MY_PUID	integer >0	process uid for this process

Activity Diagram

Following is a flow diagram showing the interaction between components during process launch, output and input handling, signaling, and process termination. The text below the figure gives additional details on the activities during this interaction. The a1 through a6 are denoted in Fig. 41 and the numbered list below further describes those activities.

../_images/launchproc.srms1.png — Fig. 41 **Launcher Component Interaction during Process Interaction**

Activity Description

Starting with a1 a process is launched by the user. During process launch a GSProcessCreate message is created by the Launcher and forwarded to Global Services through its pipe connection either to MRNet, or in the case of single-node launch, directly to the Backend. The p_uid field of the GSProcessCreate message is set to 0 to indicate it is the head process. The r_c_uid field is set to the channel id of the Backend.
When output is created by a process, as shown at a2, the output is detected by the Shepherd which creates an SHFwdOutput message and sends it directly to the Backend component. The Backend forwards these messages to the Launcher through the MRNet (if multi-node) or directly to the launcher (if single-node).
At a3 the user sends input on standard in to the Launcher. The Launcher waits for available input on stdin and reacts by creating an SHFwdInput message and sends it to the Backend on the primary node which in turn forwards it on to the Shepherd. The Shepherd then sends that same input into stdin of the head process.
Activity a4 demonstrates what happens when a signal is detected. The Launcher monitors signals that occur while it is running. During process execution, the Launcher continues to run, accepting input and signals. When a signal occurs, the signal is forwarded in the same manner, this time through Global Services, then the Shepherd, to signal the head process. Note that the Shepherd sends the SHProcessExit in response to the kill request. It is not acknowledging that it exited in this case. Simply that the signal was sent to the process. Then Global Services acknowledges the signal was sent through the GSProcessKillResponse message. Again, this does not mean that the process exited or otherwise terminated. It simply indicates the signal was received or provides an error code indicating what went wrong. For instance, if the process had already terminated before the signal message was received by Global Services or the Shepherd then an appropriate response code will be set. See the GSProcessKillResponse message for possible return codes to the Launcher.
At a5, after the process is created, the launcher may request notification of exit by sending the GSProcessJoin message.
When the process exits, as indicated starting at a6, the Shepherd detects the process is exiting and sends the SHProcessExit to Global Services which recognizes this was the head process and then sends the GSProcessJoinResponse message. This message is forwarded on to the Launcher through the Backend. If the user then exits the launcher command processor using the exit command and no other messages needed processing, it initiates teardown of the infrastructure as shown in Single Node Teardown or Multi Node Teardown.