MPI with Dragon +++++++++++++++ Using the new `ProcessGroup` API, Dragon can now be used to start and manage a collection of PMI/MPI based jobs. `ProcessGroup` uses the Global Services Client API Group interface, `GSGroup`, to manage the state of such processes/jobs. This functionality is currently only available on systems where Cray PALS is available. Before starting the PMI enabled group, Dragon will interact with its job initialization hooks within the PMI library to establish a unique job_id, place the PMI applications, and establish appropriate job & rank parameters. Note that due to the nature of how PMI applications work, you cannot restart a failed PMI rank, or add/remove ranks from a group of PMI processes. PMI --- PMI is used by MPI to determine various parameters for a job, such as the number of nodes and ranks, number of ranks per node, etc. PMI gets information to set these values via another library called PALS. The basic idea behind the PMI "plumbing" is to either bypass or hook PALS functions to set these values according to user-specified job parameters. This can be done in a few ways. * Slingshot requires a concept called `PID` to create an endpoint. One PID is required per MPI rank. PID values range between 0-511, with the lowest values being reserved by Dragon for the transport agents on a node. Remaining PIDs are allocated in contiguous intervals for all processes of a `ProcessGroup` colocated on a single node. Cray MPI is made aware of the base PID of an interval by setting `MPICH_OFI_CXI_PID_BASE=`, with the length of the interval being determined by the number of MPI ranks on the node. * The channels library, `libdragon.so`, contains the hooks for PALS functions, and we use `LD_PRELOAD` to make sure it is linked before other libraries. * The `_DRAGON_PALS_ENABLED` environment variable is used to enable or disable PMI support as needed. In general it is disabled, but must be enabled to launch MPI jobs via `ProcessGroup`. * `FI_CXI_RX_MATCH_MODE=hybrid` is needed to prevent exhaustion of Slingshot NIC matching resources during periods of heavy network traffic, i.e., many-to-many communication patterns. * PMI requires a unique value for its "control network", which is used to implement PMI distributed pseudo-KVS. Global services allocates unique ports for each MPI job, and sets the port via the `PMI_CONTROL_PORT` environment variable. Note that PMI control ports can’t be determined "locally" within each job, since there can be overlap between nodes in different jobs.