dragon.native.process_group
The Dragon native class for managing the life-cycle of a group of Dragon processes.
The intent is for the class to be agnostic about what the processes are doing, it only maintains their lifecycle.
This file implements a client API class and a Manager process handling all groups on the node. The manager holds a list of ProcessGroupState classes and gets signals from the client classes using a queue. The group of processes undergoes state transitions depending on the signals the client send to the manager.
Due to transparency constraints the manager cannot send information back to the client. Instead we are using shared state
The underlying state machine looks like this:
Classes
|
|
Robustly manage the lifecycle of a group of Dragon processes using Dragon Global Services. |
- class ProcessGroup
Robustly manage the lifecycle of a group of Dragon processes using Dragon Global Services.
This is really a state machine of the underlying processes. We should always be able to “ask” the manager process for the state of the group and send it a signal to make a state transition.
- __init__(restart: bool = True, ignore_error_on_exit: bool = False, pmi_enabled: bool = False, walltime: Optional[float] = None, policy: Optional[Policy] = None)
Instantiate a number of Dragon processes.
- Parameters
restart (bool) – if True, restart worker processes that exit unexpectedly and suppress any errors from them, defaults to True.
ignore_error_on_exit (bool) – If to ignore errors coming from processes when they exit from Join state.
flags (Worker.Flags) – optional flags that affect the handling of a worker process.
pmi_enabled (Bool) – Instruct the runtime to setup the environment so that the binary can use MPI for inter-process communication.
walltime (float) – Time in seconds until the processes in the group get killed
policy (dragon.infrastructure.policy.Policy) – determines the placement of the processes
- add_process(nproc: int, template: ProcessTemplate) None
Add processes to the ProcessGroup.
- Parameters
template (dragon.native.process.ProcessTemplate) – single template processes, i.e. unstarted process objects
nproc (int) – number of Dragon processes to start that follow the provided template
- start() None
Starts up all processes according to the templates. If
restart == False
, transition to ‘Running’, otherwise transition to ‘Maintain’.
- join(timeout: Optional[float] = None, save_puids: bool = False) None
Wait for all processes to complete and the group to transition to Idle state. If the group status is ‘Maintain’, transition to ‘Running’.
Raises TimeoutError, if the timeout occurred.
- Parameters
timeout (float) – Timeout in seconds, optional defaults to None
- Returns
True if the timeout occured, False otherwise
- Retype
bool
- kill(signal: Signals = Signals.SIGKILL, save_puids: bool = False) None
Send a signal to each process of the process group.
The signals SIGKILL and SIGTERM have the following side effects:
If the signal is SIGKILL, the group will transition to ‘Idle’. It can then be reused.
If the group status is ‘Maintain’, SIGTERM will transition it to ‘Running’.
If the group status is ‘Error’, SIGTERM will raise a
DragonProcessGroupError
.
- Parameters
signal (signal.Signals, optional) – the signal to send, defaults to signal.SIGKILL
- stop(save_puids=False) None
Forcibly terminate all workers by sending
SIGKILL
from any state, transition toStop
. This also removes the group from the manager process and marks the end of the group life-cycle.
- property inactive_puids: List[Tuple[int, int]]
Return the group’s puids and their exit codes that have exited