MPI with Dragon
Using the new ProcessGroup
API, Dragon can now be used to start and manage a collection of PMI/MPI based jobs.
ProcessGroup
uses the Global Services Client API Group interface, GSGroup
, to manage the state of such
processes/jobs.
This functionality is currently only available on systems where Cray PALS is available. Before starting the PMI enabled group, Dragon will interact with its job initialization hooks within the PMI library to establish a unique job_id, place the PMI applications, and establish appropriate job & rank parameters. Note that due to the nature of how PMI applications work, you cannot restart a failed PMI rank, or add/remove ranks from a group of PMI processes.
PMI
PMI is used by MPI to determine various parameters for a job, such as the number of nodes and ranks, number of ranks per node, etc. PMI gets information to set these values via another library called PALS. The basic idea behind the PMI “plumbing” is to either bypass or hook PALS functions to set these values according to user-specified job parameters. This can be done in a few ways.
Slingshot requires a concept called
PID
to create an endpoint. One PID is required per MPI rank. PID values range between 0-511, with the lowest values being reserved by Dragon for the transport agents on a node. Remaining PIDs are allocated in contiguous intervals for all processes of aProcessGroup
colocated on a single node. Cray MPI is made aware of the base PID of an interval by settingMPICH_OFI_CXI_PID_BASE=<base PID>
, with the length of the interval being determined by the number of MPI ranks on the node.The channels library,
libdragon.so
, contains the hooks for PALS functions, and we useLD_PRELOAD
to make sure it is linked before other libraries.The
_DRAGON_PALS_ENABLED
environment variable is used to enable or disable PMI support as needed. In general it is disabled, but must be enabled to launch MPI jobs viaProcessGroup
.FI_CXI_RX_MATCH_MODE=hybrid
is needed to prevent exhaustion of Slingshot NIC matching resources during periods of heavy network traffic, i.e., many-to-many communication patterns.PMI requires a unique value for its “control network”, which is used to implement PMI distributed pseudo-KVS. Global services allocates unique ports for each MPI job, and sets the port via the
PMI_CONTROL_PORT
environment variable. Note that PMI control ports can’t be determined “locally” within each job, since there can be overlap between nodes in different jobs.