dragon.native.process_group

The Dragon native class for managing the life-cycle of a group of Dragon processes.

The intent is for the class to be agnostic about what the processes are doing, it only maintains their lifecycle.

This file implements a client API class and a Manager process handling all groups on the node. The manager holds a list of ProcessGroupState classes and gets signals from the client classes using a queue. The group of processes undergoes state transitions depending on the signals the client send to the manager.

Due to transparency constraints the manager cannot send information back to the client. Instead we are using shared state

The underlying state machine looks like this:

../../../_images/dragon_worker_pool.svg

Classes

Manager

ProcessGroup

Robustly manage the lifecycle of a group of Dragon processes using Dragon Global Services.

class ProcessGroup

Robustly manage the lifecycle of a group of Dragon processes using Dragon Global Services.

This is really a state machine of the underlying processes. We should always be able to “ask” the manager process for the state of the group and send it a signal to make a state transition.

__init__(restart: bool = True, ignore_error_on_exit: bool = False, pmi_enabled: bool = False, walltime: Optional[float] = None, policy: Optional[Policy] = None)

Instantiate a number of Dragon processes.

Parameters
  • restart (bool) – if True, restart worker processes that exit unexpectedly and suppress any errors from them, defaults to True.

  • ignore_error_on_exit (bool) – If to ignore errors coming from processes when they exit from Join state.

  • flags (Worker.Flags) – optional flags that affect the handling of a worker process.

  • pmi_enabled (Bool) – Instruct the runtime to setup the environment so that the binary can use MPI for inter-process communication.

  • walltime (float) – Time in seconds until the processes in the group get killed

  • policy (dragon.infrastructure.policy.Policy) – determines the placement of the processes

add_process(nproc: int, template: ProcessTemplate) None

Add processes to the ProcessGroup.

Parameters
init() None

Initialize the ProcessGroupState and Manager.

start() None

Starts up all processes according to the templates. If restart == False, transition to ‘Running’, otherwise transition to ‘Maintain’.

join(timeout: Optional[float] = None, save_puids: bool = False) None

Wait for all processes to complete and the group to transition to Idle state. If the group status is ‘Maintain’, transition to ‘Running’.

Raises TimeoutError, if the timeout occurred.

Parameters

timeout (float) – Timeout in seconds, optional defaults to None

Returns

True if the timeout occured, False otherwise

Retype

bool

kill(signal: Signals = Signals.SIGKILL, save_puids: bool = False) None

Send a signal to each process of the process group.

The signals SIGKILL and SIGTERM have the following side effects:

  • If the signal is SIGKILL, the group will transition to ‘Idle’. It can then be reused.

  • If the group status is ‘Maintain’, SIGTERM will transition it to ‘Running’.

  • If the group status is ‘Error’, SIGTERM will raise a DragonProcessGroupError.

Parameters

signal (signal.Signals, optional) – the signal to send, defaults to signal.SIGKILL

stop(save_puids=False) None

Forcibly terminate all workers by sending SIGKILL from any state, transition to Stop. This also removes the group from the manager process and marks the end of the group life-cycle.

property puids: list[int]

Return the puids of the processes contained in this group.

Returns

a list of puids

Return type

list[int]

property inactive_puids: List[Tuple[int, int]]

Return the group’s puids and their exit codes that have exited

Returns

a list of tuples (puid, exit_code)

Return type

List[Tuple[int, int]]

property exit_status: List[Tuple[int, int]]

Return the group’s puids and their exit codes that have exited

Returns

a list of tuples (puid, exit_code)

Return type

List[Tuple[int, int]]

property status: str

Get the current status of the process group handled by this instance.

Returns

current status of the group

Return type

str