dragon.native.process_group

The Dragon native class for managing the life-cycle of a group of Dragon processes.

The intent is for the class to be agnostic about what the processes are doing, it only maintains their lifecycle.

This file implements a client API class and a Manager process handling all groups on the node. The manager holds a list of ProcessGroupState classes and gets signals from the client classes using a queue. The group of processes undergoes state transitions depending on the signals the client send to the manager.

Due to transparency constraints the manager cannot send information back to the client. Instead we are using shared state

The underlying state machine looks like this:

Classes

`Manager`
`ProcessGroup`	Robustly manage the lifecycle of a group of Dragon processes using Dragon Global Services.

class ProcessGroup

Robustly manage the lifecycle of a group of Dragon processes using Dragon Global Services.

This is really a state machine of the underlying processes. We should always be able to “ask” the manager process for the state of the group and send it a signal to make a state transition.

__init__(restart: bool = True, ignore_error_on_exit: bool = False, pmi_enabled: bool = False, walltime: Optional[float] = None, policy: Optional[Policy] = None)

Instantiate a number of Dragon processes.

Parameters

restart (bool) – if True, restart worker processes that exit unexpectedly and suppress any errors from them, defaults to True.
ignore_error_on_exit (bool) – If to ignore errors coming from processes when they exit from Join state.
flags (Worker.Flags) – optional flags that affect the handling of a worker process.
pmi_enabled (Bool) – Instruct the runtime to setup the environment so that the binary can use MPI for inter-process communication.
walltime (float) – Time in seconds until the processes in the group get killed
policy (dragon.infrastructure.policy.Policy) – determines the placement of the processes

add_process(nproc: int, template: ProcessTemplate) → None

Add processes to the ProcessGroup.

Parameters

template (dragon.native.process.ProcessTemplate) – single template processes, i.e. unstarted process objects
nproc (int) – number of Dragon processes to start that follow the provided template

init() → None: Initialize the ProcessGroupState and Manager.

start() → None: Starts up all processes according to the templates. If restart == False, transition to ‘Running’, otherwise transition to ‘Maintain’.

join(timeout: Optional[float] = None, save_puids: bool = False) → None

Wait for all processes to complete and the group to transition to Idle state. If the group status is ‘Maintain’, transition to ‘Running’.

Raises TimeoutError, if the timeout occurred.

Parameters: timeout (float) – Timeout in seconds, optional defaults to None
Returns: True if the timeout occured, False otherwise
Retype: bool

kill(signal: Signals = Signals.SIGKILL, save_puids: bool = False) → None

Send a signal to each process of the process group.

The signals SIGKILL and SIGTERM have the following side effects:

If the signal is SIGKILL, the group will transition to ‘Idle’. It can then be reused.
If the group status is ‘Maintain’, SIGTERM will transition it to ‘Running’.
If the group status is ‘Error’, SIGTERM will raise a DragonProcessGroupError.

Parameters: signal (signal.Signals, optional) – the signal to send, defaults to signal.SIGKILL

stop(save_puids=False) → None: Forcibly terminate all workers by sending SIGKILL from any state, transition to Stop. This also removes the group from the manager process and marks the end of the group life-cycle.

property puids: list[int]

Return the puids of the processes contained in this group.

Returns: a list of puids
Return type: list[int]

property inactive_puids: List[Tuple[int, int]]

Return the group’s puids and their exit codes that have exited

Returns: a list of tuples (puid, exit_code)
Return type: List[Tuple[int, int]]

property exit_status: List[Tuple[int, int]]

Return the group’s puids and their exit codes that have exited

Returns: a list of tuples (puid, exit_code)
Return type: List[Tuple[int, int]]

property status: str

Get the current status of the process group handled by this instance.

Returns: current status of the group
Return type: str