Multi Node Deployment
This section outlines multi-node bringup and teardown procedure for multi-node systems.
Multi node launches start from a frontend node that the user is assumed to have a command prompt on. This frontend can be co-located with backend nodes or run from its own node. All off-node communication that is initiated by the Local Services goes through Global Services. Local Services itself has a one-node view of the world while Global Services does the work of communicating off node when necessary. Fig. 46 and Fig. 47 depict a multi-node version of the Dragon Services.
Multi Node Bringup
Network Configuration
The launcher frontend must know what resources are available for its use on the
compute backend. To obtain that information for a given set of workload
managers, there is a network config tool in the launcher module. This tool is
exposed for general use. However, if deploying dragon on a supported workload
manager with an active job allocation, the launcher frontend will handle
obtaining the network configuration. It exists in dragon.launcher.network_config.main
,
but can be invoked directly via dragon-network-config
. Its help is below:
usage: dragon-network-config [-h] [-p PORT] [--network-prefix NETWORK_PREFIX] [--wlm WORKLOAD_MANAGER] [--log] [--output-to-yaml] [--output-to-json]
[--no-stdout] [--primary PRIMARY] [--hostlist HOSTLIST | --hostfile HOSTFILE]
Runs Dragon internal tool for generating network topology
optional arguments:
-h, --help show this help message and exit
-p PORT, --port PORT Infrastructure listening port (default: 6565)
--network-prefix NETWORK_PREFIX
NETWORK_PREFIX specifies the network prefix the dragon runtime will use to determine which IP addresses it should use to build
multinode connections from. By default the regular expression r'^(hsn|ipogif|ib)\d+$' is used -- the prefix for known HPE-Cray XC
and EX high speed networks. If uncertain which networks are available, the following will return them in pretty formatting: `dragon-
network-ifaddrs --ip --no-loopback --up --running | jq`. Prepending with `srun` may be necessary to get networks available on
backend compute nodes
--wlm WORKLOAD_MANAGER, -w WORKLOAD_MANAGER
Specify what workload manager is used. Currently supported WLMs are: slurm, pbs+pals, ssh
--log, -l Enable debug logging
--output-to-yaml, -y Output configuration to YAML file
--output-to-json, -j Output configuration to JSON file
--no-stdout Do not print the configuration to stdout
--primary PRIMARY Specify the hostname to be used for the primary compute node
--hostlist HOSTLIST Specify backend hostnames as a comma-separated list, eg: `--hostlist host_1,host_2,host_3`. `--hostfile` or `--hostlist` is a
required argument for WLM SSH and is only used for SSH
--hostfile HOSTFILE Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames.
`--hostfile` or `--hostlist` is a required argument for WLM SSH and is only used for SSH
# To create YAML and JSON files with a slurm WLM:
$ dragon-network-config --wlm slurm --output-to-yaml --output-to-json
If output to file (YAML or JSON are supported), the file can be provided to the launcher frontend at launch. Formatting of the files appears below:
1'0':
2 h_uid: null
3 host_id: 18446744071562724608
4 ip_addrs:
5 - 10.128.0.5:6565
6 is_primary: true
7 name: nid00004
8 num_cpus: 0
9 physical_mem: 0
10 shep_cd: ''
11 state: 4
12'1':
13 h_uid: null
14 host_id: 18446744071562724864
15 ip_addrs:
16 - 10.128.0.6:6565
17 is_primary: false
18 name: nid00005
19 num_cpus: 0
20 physical_mem: 0
21 shep_cd: ''
22 state: 4
1{
2 "0": {
3 "state": 4,
4 "h_uid": null,
5 "name": "nid00004",
6 "is_primary": true,
7 "ip_addrs": [
8 "10.128.0.5:6565"
9 ],
10 "host_id": 18446744071562724608,
11 "num_cpus": 0,
12 "physical_mem": 0,
13 "shep_cd": ""
14 },
15 "1": {
16 "state": 4,
17 "h_uid": null,
18 "name": "nid00005",
19 "is_primary": false,
20 "ip_addrs": [
21 "10.128.0.6:6565"
22 ],
23 "host_id": 18446744071562724864,
24 "num_cpus": 0,
25 "physical_mem": 0,
26 "shep_cd": ""
27 }
28}
Launching
The Local Services and Global Services are instantiated when a program is launched by the Dragon Launcher. There is one Local Services instance on each node and one Global Services instance in the entire job. Since all Services run as user-level services (i.e. not with superuser authority), the services described here are assumed to be one per launched user program.
The multi-node bring-up sequence is given in startup-seq-multinode
and in the section titled
Multi Node Bringup where the message descriptions are also provided. The
Launcher Frontend brings up an instance of the Launcher Backend on each node.
Each launcher (frontend and backend) then brings up an instance of the TCP
Transport Agent which serves to create an Overlay Tree for communicating
infrastructure-related messages.
The Launcher Backend then brings up Local Services. The Backend forwards messages from the Launcher Frontend to Local Services. Local Services forwards output from the user program to the Frontend through the Backend.
Sequence diagram
The diagram below depicts the message flow in the multi-node startup sequence.
Sequence diagram of Dragon multi-node bringup
Notes on Bring-up Sequence
- Get Network Configuration (A1-A3)
Launch 1 instance of network config tool on each backend compute node. Each instance will use other dragon tools to determine a preferred IP address for frontend and its unique Host ID that will be used in routing messages to and from it.
Each network config instance will line buffer the network information to stdout as a serialized JSON structure. The frontend will consume this information.
The config tool will then exit.
- Workload Manager Launch (Bring-up M1, M5 | Teardown M24)
Launch one instance of given script/executable on each backend. No CPU affinity binding should be used (eg: `srun -n <# nodes> -N <# nodes> –cpu_bind=none python3 backend.py)
The exit code from the WLM should be monitored for errors.
- Launch frontend overlay agent (A3-A4)
Launch TCP transport agent. Over command-line arguments, provide channels for communication to and from the frontend (appears collectively as
CTRL ch
in the sequence diagrams), IP addresses of all backend compute nodes, and host IDs of all backend compute nodes.NOTE: The Frontend and Backend each manage their own memory pool independent of Local Services specifically for managing the Channels necessary for Overlay Network communication
In the environment, there will be environment variables set to tell the overlay network what gateway channels it should monitor for remote messaging.
Once the agent is up, it sends a ping saying as much.
- Overlay input arguments (M3, M6):
- start_overlay_network(ch_in_sdesc: B64, ch_out_sdesc: B64, log_sdesc: B64, host_ids: list[str], ip_addrs: list[str], frontend: bool = False, env: dict | None = None)
Entry point to start the launcher overlay network
- Parameters:
ch_in_sdesc (B64) – Channel to be used for infrastructure messages incoming from launcher frontend/backend. One component of “CTRL ch”
ch_out_sdesc (B64) – Channel to be used for infrastructure messages outgoing to launcher frontend/backend. One component of “CTRL ch”
log_sdesc (B64) – Channel to be used for sending logging messages to the Frontend
host_ids (list[str]) – List of host IDs for all nodes that will be communicated with
ip_addrs (list[str]) – List of IP address for all nodes that will be communicated with
frontend (bool, optional) – Is parent process the launcher frontned defaults to False
env (dict, optional) – environment variables. Should container gateway channels, defaults to None
- Returns:
Popen object for the overlay network process
- Return type:
- Launch backend and its overlay network (A4-M8)
Use workload manager to launch 1 instance of backend on each compute node. In command line arguments, provide a channel to communicate with frontend, the frontend’s IP address, and its host ID.
The Backend will provide the frontend channel, IP, and host ID to start its own TCP overlay agent as in Launch frontend overlay agent (A3-A4). Using the provided information via the launch command and its own locally obtained IP address and host ID, each backend instance will be able to communicate directly with the single frontend instance and vice-versa.
Once backend and its overlay agent is up, it will send a
BEIsUp
message to the frontend. The frontend waits till it receives this message from every backend before continuing.
- Provide Node Index to each Backend instance (M9)
The Frontend will use host ID information in the
BEIsUp
message to assign node indices. The primary node is given node index 0 and can be specified by hostname via the user when starting the Dragon runtime. The remaining values will be integers up toN-1
, where isN
is the number of backend compute nodes.The node index is given in
FENodeIdxBE
. It travels from the frontend over the channel whose descriptor was provided in theBEIsUp
by the backend.
- Start Local Services (A12-A15)
This is the most critical part of bring-up. If it is successful, everything else will likely be fine.
Launcher Backend popens Local Services. Over
stdin
, it providesBENodeIdxSH
. That messages contains logging channel descriptor, hostname, primary node status (boolean), node index, and IP address for backend communication This is a potentially different IP address than the one being used for the overlay network. If it’s the same IP, it IS a different port, as it will be used by a different transport agent for communication among other backend nodes, NOT the frontend.SHPingBE
is returned to the backend overstdout
. This message contains the channel descriptors that will be used for all future communication on the node.The backend sends
BEPingSH
to confirm channel comms are successful.SHChannelsUp
is returned by Local Services. It contains channel descriptors for all service communication for its node, including Global Services (for the primary node), which every other node will use to communicate with the Global Services instance on the primary node. The Host ID is also contained inSHChannelsUp
.
- Many-to-One SHChannelsUP, TAUp, TAHalted, SHHaltBE (Bring-up M14, M21 | Teardown M15, M19):
These messages are a gather operation from all the backend nodes to the frontend. It represents a potential hang situation that requires care to eliminate the likelihood of a hung bring-up sequence.
- One-to-Many LAChannelsInfo, SHHaltTA, (Bring-up M15 | Teardown M10, M16):
These messages and their communication (one-to-many Bcast) represents one of the biggest bottlenecks in the bring-up sequence. For larger messages, the first step in reducing that cost is compressing. Next would be to implement a tree Bcast.
- Transmit LAChannelsInfo (A15-A16)
This is the most critical message. It contains all information necessary to make execution of the Dragon runtime possible.
The Frontend receives one
SHChannelsUp
from each backend. It aggregates these intoLAChannelsInfo
. It is potentially a very large message as it scales linearly with the number of nodes. It is beneficial to compress this message before transmitting.
- Start Transport Agent (A16-A18)
Local Services starts the Transport Agent as a child process. Over the command line, it passes the node index of the node, a channel descriptor to receive messages from Local Services (TA ch), logging channel descriptor. In the environment are environment variables containing channel descriptors for gateway channels, to be used for sending and receiving messages off-node.
Over TA ch,
LAChannelsInfo
is provided, so the transport agent can take IP and host ID info to manage communication to other nodes on the backend.The transport agent communicates a
TAPingSH
to Local Services over the Local Services channel provided for its node index in theLAChannelsInfo
message. This is sent upstream as anTAUp
message. The frontend waits for such a message from every backend.
- CLI Inputs to Start Transport Agent (M17)
- resolve_args(args=None)
- start_transport_agent(node_index: int | str, in_ch_sdesc: B64, log_ch_sdesc: B64 | None = None, args: list[str] | None = None, env: dict[str, str] | None = None, gateway_channels: Iterable[Channel] | None = None, infrastructure: bool = False) Popen
Start the backend transport agent
- Parameters:
node_index (Union[int, str]) – node index for this agent. should be in 0, N-1, where N is number of backend compute nodes
in_ch_sdesc (B64) – Channel to be used for incoming infrastructure messages
log_ch_sdesc (B64, optional) – Channel to be used for logging to the Frontend, defaults to None
args (list[str], optional) – Command args for starting Transport Agents, defaults to None in which case the TCP transport agent is initiated as default
env (dict[str, str], optional) – Environment variables, defaults to None
gateway_channels (Iterable[Channel], optional) – gateway channels to be set in transport agent’s environment. If not set, it is assumed they are already set in the parent environment or the one input, defaults to None
infrastructure (bool, optional) – whether this agent will be used for infrastructure messaging, defaults to False
- Returns:
Transport agent child process
- Return type:
- Start Global Services (A18-A22)
After sending
TAUp
, Local Services on the primary node starts Global Services. Overstdin
Local Services providesLAChannelsInfo
. With this information, Global Services is immediately able to ping all Local Services’ instances on the backend viaGSPingSH
(note: the Transport Agent is necessary to route those correctly).After receiving a response from every Local Services (
SHPingGS
), it sends aGSIsUp
that is ultimately routed to the Frontend. It is important to note theGSIsUp
message is not guaranteed to arrive after allTAUp
messages, ie:GSIsUp
will necessarily come afterTAUp
on the primary node but may come before theTAUp
message from any other backend node.Once all
TAup
andGSIsUp
messages are received, the Dragon runtime is fully up.
- Start User Program (A22-A25)
The Frontend starts the user program. This program was provided with the launcher’s invocation. It is sent to the primary node via a
GSProcessCreate
message. Upon receipt, Global Services sends aGSProcessCreateResponse
.After sending the response, it selects a Local Services to deploy the program to. After selection it sends this request as a
SHProcessCreate
to the selected Local Services. This Local Services sends aSHProcessCreateResposne
as receipt.
- Route User Program stdout (A25-A26)
Local Services ingests all the User Program
stdout
andstderr
. This is propogated to the Frontend as aSHFwdOutput
message that contains information about what process the strings originated from. With this information, the Frontend can provide the output via itsstdout
so the user can see it buffered cleanly and correctly with varying levels of verbosity.
Multi Node Teardown
In the multi-node case, there are a few more messages that are processed by
Local Services than in the single-node case. The GSHalted
message is forwarded
by Local Services to the launcher backend. The SHHaltTA
is sent to Local Services,
and it forwards the TAHalted
message to the backend when received.
Other than these three additional messages, the tear down is identical to the
single node version of tear down. One difference is that this tear down process
is repeated on every one of the compute nodes in the case of the multi-node
tear down.
In an abnormal situation, the AbnormalTermination
message may be received by
the Launcher from either Local Services or Global Services (via the Backend). In
that case, the launcher will initiate a teardown of the infrastructure starting
with sending of GSTeardown
(message 5 in diagram below).
Sequence diagram
Multi-Node Teardown Sequence
Notes on Teardown Sequence
- Head Proc Exit (A1-A4)
Local Services monitors its managed processes via
waitpid
. Once it registers an exit, it matches its pid to the internally tracked and globally unique puid. This puid is transmitted viaSHProcessExit
to Global Services.Global Services cleans up any resources tied to tracking that specific head process. Once that is complete, it alerts the Frontend via a
GSHeadExit
.
- Local Services Death Watcher (M1):
Local Services’ main thread has a thread repeatedly calling
waitpid
. This thread will receive the exit code from the head process.
- Begin Teardown (A5-A7)
With exit of the head process, as of this writing (03/24/2023), the Launcher will begin teardown.
Teardown is initiated via the
GSTeardown
message to Global Services. Once this message is sent, every thing is ignored from user input. Teardown is what is being done.GSTeardown
is also the message sent via a SIGINT signal from the user (assuming the full runtime is up).Consequently with a full runtime,
GSTeardown
is always the first point of entry for exiting the Dragon runtime no matter the state of various individual runtime services.
- Halt Global Services (A7-A9)
Once Global Services receives the
GSTeardown
, it detaches from any channels. It does not destroy any (aside from any it may have directly created) because Local Services has created the channels used for infrastructure and manages the memory created for them.Once it has shutdown its resources, it transmits
GSHalted
to Local Services overstdout
and exits. This is forwarded all the way to the Frontend.
- Halt Transport Agent (A9-A12)
After the Frontend receives
GSHalted
, it initiates teardown of all Backend Transport Agents viaSHHaltTA
which is eventually routed to the transport agent from Local Services.Upon receipt, it cancels any outstanding work it has, sends
TAHalted
over its channel to Local Services (NOTE: This should be over stdout) and exits.Local Services forwards the exit to the Frontend via routing of a
TAHalted
messages. The Frontend waits for receipt of N messages before continuing (NOTE: This can easily result in hung processes and should be addressed).
- Issue SHTeardown (A13-A16)
This is the first step in the more carefully orchestrated destruction of the Backend. Once Local Services receives
SHTeardown
, it detaches from its logging channel and tranmits aSHHaltBE
to the Backend.After destroying its logging infrastructure, the Backend forwards the
SHHaltBE
to the Frontend. The Frontend waits till it receives NSHHaltBE
messages before exiting. (NOTE: Another potential mess).
- Issue BEHalted (A17-A19)
Frontend sends
BEHalted
to all Backends. Once sent, there is no more direct communication between Backend and Frontend. The Frontend will simply confirm completion via exit codes from the workload manager.Once Local Services receives its
BEHalted
, it deallocates all its allocated memory and exits.
- Shutdown Backend and its overlay (A20-A25)
After transmitting
BEHalted
to Local Services, the Backend issuesSHHaltTA
to its overlay transport agent. This triggers teardown identical to the transport referenced in Halt Transport Agent (A9-A12).Once the overlay is down and Local Services is down as measured by a simple
wait
on child process exits, the Backend deallocated its managed memory, including logging infrastructure and exits.
- Shutdown Frontend (A26-A30)
After the workload manager returns exit code for the Backend exit, the Frontend shuts down its overlay similar to the Backend in Shutdown Backend and its overlay (A20-A25). After waiting on the successful overlay exit, it similarly deallocates managed memory and exits.
After the Frontend exits, the Dragon runtime is down.