Multi Node Deployment

This section outlines multi-node bringup and teardown procedure for multi-node systems.

Multi node launches start from a frontend node that the user is assumed to have a command prompt on. This frontend can be co-located with backend nodes or run from its own node. All off-node communication that is initiated by the Local Services goes through Global Services. Local Services itself has a one-node view of the world while Global Services does the work of communicating off node when necessary. Fig. 44 and Fig. 45 depict a multi-node version of the Dragon Services.

../_images/deployment_multi_node.svg

Fig. 44 Startup Overview

infrastructure/images/multinodeoverview.png

Fig. 45 Multi-Node Overview of Dragon Services

Multi Node Bringup

Network Configuration

The launcher frontend must know what resources are available for its use on the compute backend. To obtain that information for a given set of workload managers, there is a network config tool in the launcher module. This tool is exposed for general use. However, if deploying dragon on a supported workload manager with an active job allocation, the launcher frontend will handle obtaining the network configuration. It exists in dragon.launcher.network_config.main, but can be invoked directly via dragon-network-config. Its help is below:

Listing 43 Dragon Network Config (dragon-network-config) tool’s help and basic use
usage: dragon-network-config [-h] [-p PORT] [--network-prefix NETWORK_PREFIX] [--wlm WORKLOAD_MANAGER] [--log] [--output-to-yaml] [--output-to-json]
                             [--no-stdout] [--primary PRIMARY] [--hostlist HOSTLIST | --hostfile HOSTFILE]

Runs Dragon internal tool for generating network topology

optional arguments:
  -h, --help            show this help message and exit
  -p PORT, --port PORT  Infrastructure listening port (default: 6565)
  --network-prefix NETWORK_PREFIX
                        NETWORK_PREFIX specifies the network prefix the dragon runtime will use to determine which IP addresses it should use to build
                        multinode connections from. By default the regular expression r'^(hsn|ipogif|ib)\d+$' is used -- the prefix for known HPE-Cray XC
                        and EX high speed networks. If uncertain which networks are available, the following will return them in pretty formatting: `dragon-
                        network-ifaddrs --ip --no-loopback --up --running | jq`. Prepending with `srun` may be necessary to get networks available on
                        backend compute nodes
  --wlm WORKLOAD_MANAGER, -w WORKLOAD_MANAGER
                        Specify what workload manager is used. Currently supported WLMs are: slurm, pbs+pals, ssh
  --log, -l             Enable debug logging
  --output-to-yaml, -y  Output configuration to YAML file
  --output-to-json, -j  Output configuration to JSON file
  --no-stdout           Do not print the configuration to stdout
  --primary PRIMARY     Specify the hostname to be used for the primary compute node
  --hostlist HOSTLIST   Specify backend hostnames as a comma-separated list, eg: `--hostlist host_1,host_2,host_3`. `--hostfile` or `--hostlist` is a
                        required argument for WLM SSH and is only used for SSH
  --hostfile HOSTFILE   Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames.
                        `--hostfile` or `--hostlist` is a required argument for WLM SSH and is only used for SSH

# To create YAML and JSON files with a slurm WLM:
$ dragon-network-config --wlm slurm --output-to-yaml --output-to-json

If output to file (YAML or JSON are supported), the file can be provided to the launcher frontend at launch. Formatting of the files appears below:

Listing 44 Example of YAML formatted network configuration file
 1'0':
 2  h_uid: null
 3  host_id: 18446744071562724608
 4  ip_addrs:
 5  - 10.128.0.5:6565
 6  is_primary: true
 7  name: nid00004
 8  num_cpus: 0
 9  physical_mem: 0
10  shep_cd: ''
11  state: 4
12'1':
13  h_uid: null
14  host_id: 18446744071562724864
15  ip_addrs:
16  - 10.128.0.6:6565
17  is_primary: false
18  name: nid00005
19  num_cpus: 0
20  physical_mem: 0
21  shep_cd: ''
22  state: 4
Listing 45 Example of JSON formatted network configuration file
 1{
 2  "0": {
 3        "state": 4,
 4        "h_uid": null,
 5        "name": "nid00004",
 6        "is_primary": true,
 7        "ip_addrs": [
 8            "10.128.0.5:6565"
 9        ],
10        "host_id": 18446744071562724608,
11        "num_cpus": 0,
12        "physical_mem": 0,
13        "shep_cd": ""
14    },
15    "1": {
16        "state": 4,
17        "h_uid": null,
18        "name": "nid00005",
19        "is_primary": false,
20        "ip_addrs": [
21            "10.128.0.6:6565"
22        ],
23        "host_id": 18446744071562724864,
24        "num_cpus": 0,
25        "physical_mem": 0,
26        "shep_cd": ""
27    }
28}

Launching

The Local Services and Global Services are instantiated when a program is launched by the Dragon Launcher. There is one Local Services instance on each node and one Global Services instance in the entire job. Since all Services run as user-level services (i.e. not with superuser authority), the services described here are assumed to be one per launched user program.

The multi-node bring-up sequence is given in startup-seq-multinode and in the section titled Multi Node Bringup where the message descriptions are also provided. The Launcher Frontend brings up an instance of the Launcher Backend on each node. Each launcher (frontend and backend) then brings up an instance of the TCP Transport Agent which serves to create an Overlay Tree for communicating infrastructure-related messages.

The Launcher Backend then brings up Local Services. The Backend forwards messages from the Launcher Frontend to Local Services. Local Services forwards output from the user program to the Frontend through the Backend.

Sequence diagram

The diagram below depicts the message flow in the multi-node startup sequence.

Launcher FrontendCompute Backend[A1]launch network configurator[M1]WLM launch[A2]line buffer JSON networkconfig to stdout[M2]Exit[A3]OS-spawn overlay TCP agent[M3]CLInetwork config& CTRL chTCP Overlay Front End[M4]CTRL chOverlayPingLA[A4]launch backend[M5]WLM launchLauncher Backend[A5]OS-spawn overlay TCP agent[M6]CLInetwork config& CTRL chTCP Overlay Backend[M7]CTRL chOverlayPingBE[A6]connect to frontend channel[M8]LA chBEIsUp[M9]BE chFENodeIdxBE[A7]OS spawn local services[M10]stdinBENodeIdxSHLocal Services[A8]make shared memorysegments and channels[M11]stdoutSHPingBE[A9]attach to LS channel[M12]LS chBEPingSH[M13]BE chSHChannelsUp[M14]LA chSHChannelsUp #N[A10]gather channels-up[M15]BE chLAChannelsInfo #N[M16]LS chLAChannelsInfo[A11]OS spawn transport agent[M17]CLI+envTA,GW chs.Transport Agent[M18]TA chLAChannelsInfo[A12]notify LS up[M19]LS chTAPingSH[M20]BE chTAUp[M21]LA chTAUp node #N[A13]OS spawn global serviceson primary node[M22]stdinLAChannelsInfoGlobal Services[A14]attach to channels[A15]ping all LS[M23]LS chGSPingSH[M24]GS chSHPingGS #N[A16]notify runtime up[M25]prime BE chGSIsUp[M26]prime LA chGSIsUpDragon Runtime is up[A17]launch head procpython3 my.py parms[M27]prime BE chGSProcessCreate my.py, parms[M28]prime GS chGSProcessCreate my.py, parms[M29]prime BE chGSProcessCreateResponse[A18]register head process[A19]start on selected LS[M30]prime LA chGSProcessCreateResponse[M31]LS chSHProcessCreate my.py, parms[M32]GS chSHProcessCreateResponse[M33]spawn my.py processUser Program Entry[A20]example: print('foo')[M34]stdout'foo'[M35]BE chSHFwdOutput 'foo'[M36]LA chSHFwdOutput 'foo'[A21]print `SHFwdOutput` contents to stdoutMain output continues

Sequence diagram of Dragon multi-node bringup

Notes on Bring-up Sequence

Get Network Configuration (A1-A3)

Launch 1 instance of network config tool on each backend compute node. Each instance will use other dragon tools to determine a preferred IP address for frontend and its unique Host ID that will be used in routing messages to and from it.

Each network config instance will line buffer the network information to stdout as a serialized JSON structure. The frontend will consume this information.

The config tool will then exit.

Workload Manager Launch (Bring-up M1, M5 | Teardown M24)

Launch one instance of given script/executable on each backend. No CPU affinity binding should be used (eg: `srun -n <# nodes> -N <# nodes> –cpu_bind=none python3 backend.py)

The exit code from the WLM should be monitored for errors.

Launch frontend overlay agent (A3-A4)

Launch TCP transport agent. Over command-line arguments, provide channels for communication to and from the frontend (appears collectively as CTRL ch in the sequence diagrams), IP addresses of all backend compute nodes, and host IDs of all backend compute nodes.

NOTE: The Frontend and Backend each manage their own memory pool independent of Local Services specifically for managing the Channels necessary for Overlay Network communication

In the environment, there will be environment variables set to tell the overlay network what gateway channels it should monitor for remote messaging.

Once the agent is up, it sends a ping saying as much.

Overlay input arguments (M3, M6):
start_overlay_network(ch_in_sdesc: B64, ch_out_sdesc: B64, log_sdesc: B64, host_ids: list[str], ip_addrs: list[str], frontend: bool = False, env: Optional[dict] = None)

Entry point to start the launcher overlay network

Parameters
  • ch_in_sdesc (B64) – Channel to be used for infrastructure messages incoming from launcher frontend/backend. One component of “CTRL ch”

  • ch_out_sdesc (B64) – Channel to be used for infrastructure messages outgoing to launcher frontend/backend. One component of “CTRL ch”

  • log_sdesc (B64) – Channel to be used for sending logging messages to the Frontend

  • host_ids (list[str]) – List of host IDs for all nodes that will be communicated with

  • ip_addrs (list[str]) – List of IP address for all nodes that will be communicated with

  • frontend (bool, optional) – Is parent process the launcher frontned defaults to False

  • env (dict, optional) – environment variables. Should container gateway channels, defaults to None

Returns

Popen object for the overlay network process

Return type

subprocess.Popen

Launch backend and its overlay network (A4-M8)

Use workload manager to launch 1 instance of backend on each compute node. In command line arguments, provide a channel to communicate with frontend, the frontend’s IP address, and its host ID.

The Backend will provide the frontend channel, IP, and host ID to start its own TCP overlay agent as in Launch frontend overlay agent (A3-A4). Using the provided information via the launch command and its own locally obtained IP address and host ID, each backend instance will be able to communicate directly with the single frontend instance and vice-versa.

Once backend and its overlay agent is up, it will send a BEIsUp message to the frontend. The frontend waits till it receives this message from every backend before continuing.

Provide Node Index to each Backend instance (M9)

The Frontend will use host ID information in the BEIsUp message to assign node indices. The primary node is given node index 0 and can be specified by hostname via the user when starting the Dragon runtime. The remaining values will be integers up to N-1, where is N is the number of backend compute nodes.

The node index is given in FENodeIdxBE. It travels from the frontend over the channel whose descriptor was provided in the BEIsUp by the backend.

Start Local Services (A12-A15)

This is the most critical part of bring-up. If it is successful, everything else will likely be fine.

Launcher Backend popens Local Services. Over stdin, it provides BENodeIdxSH. That messages contains logging channel descriptor, hostname, primary node status (boolean), node index, and IP address for backend communication This is a potentially different IP address than the one being used for the overlay network. If it’s the same IP, it IS a different port, as it will be used by a different transport agent for communication among other backend nodes, NOT the frontend.

SHPingBE is returned to the backend over stdout. This message contains the channel descriptors that will be used for all future communication on the node.

The backend sends BEPingSH to confirm channel comms are successful. SHChannelsUp is returned by Local Services. It contains channel descriptors for all service communication for its node, including Global Services (for the primary node), which every other node will use to communicate with the Global Services instance on the primary node. The Host ID is also contained in SHChannelsUp.

Many-to-One SHChannelsUP, TAUp, TAHalted, SHHaltBE (Bring-up M14, M21 | Teardown M15, M19):

These messages are a gather operation from all the backend nodes to the frontend. It represents a potential hang situation that requires care to eliminate the likelihood of a hung bring-up sequence.

One-to-Many LAChannelsInfo, SHHaltTA, (Bring-up M15 | Teardown M10, M16):

These messages and their communication (one-to-many Bcast) represents one of the biggest bottlenecks in the bring-up sequence. For larger messages, the first step in reducing that cost is compressing. Next would be to implement a tree Bcast.

Transmit LAChannelsInfo (A15-A16)

This is the most critical message. It contains all information necessary to make execution of the Dragon runtime possible.

The Frontend receives one SHChannelsUp from each backend. It aggregates these into LAChannelsInfo. It is potentially a very large message as it scales linearly with the number of nodes. It is beneficial to compress this message before transmitting.

Start Transport Agent (A16-A18)

Local Services starts the Transport Agent as a child process. Over the command line, it passes the node index of the node, a channel descriptor to receive messages from Local Services (TA ch), logging channel descriptor. In the environment are environment variables containing channel descriptors for gateway channels, to be used for sending and receiving messages off-node.

Over TA ch, LAChannelsInfo is provided, so the transport agent can take IP and host ID info to manage communication to other nodes on the backend.

The transport agent communicates a TAPingSH to Local Services over the Local Services channel provided for its node index in the LAChannelsInfo message. This is sent upstream as an TAUp message. The frontend waits for such a message from every backend.

CLI Inputs to Start Transport Agent (M17)
start_transport_agent(node_index: Union[int, str], in_ch_sdesc: B64, log_ch_sdesc: Optional[B64] = None, args: Optional[list[str]] = None, env: Optional[dict[str, str]] = None, gateway_channels: Optional[Iterable[Channel]] = None, infrastructure: bool = False) Popen

Start the backend transport agent

Parameters
  • node_index (Union[int, str]) – node index for this agent. should be in 0, N-1, where N is number of backend compute nodes

  • in_ch_sdesc (B64) – Channel to be used for incoming infrastructure messages

  • log_ch_sdesc (B64, optional) – Channel to be used for logging to the Frontend, defaults to None

  • args (list[str], optional) – Command args for starting Transport Agents, defaults to None in which case the TCP transport agent is initiated as default

  • env (dict[str, str], optional) – Environment variables, defaults to None

  • gateway_channels (Iterable[Channel], optional) – gateway channels to be set in transport agent’s environment. If not set, it is assumed they are already set in the parent environment or the one input, defaults to None

  • infrastructure (bool, optional) – whether this agent will be used for infrastructure messaging, defaults to False

Returns

Transport agent child process

Return type

subprocess.Popen

Start Global Services (A18-A22)

After sending TAUp, Local Services on the primary node starts Global Services. Over stdin Local Services provides LAChannelsInfo. With this information, Global Services is immediately able to ping all Local Services’ instances on the backend via GSPingSH (note: the Transport Agent is necessary to route those correctly).

After receiving a response from every Local Services (SHPingGS), it sends a GSIsUp that is ultimately routed to the Frontend. It is important to note the GSIsUp message is not guaranteed to arrive after all TAUp messages, ie: GSIsUp will necessarily come after TAUp on the primary node but may come before the TAUp message from any other backend node.

Once all TAup and GSIsUp messages are received, the Dragon runtime is fully up.

Start User Program (A22-A25)

The Frontend starts the user program. This program was provided with the launcher’s invocation. It is sent to the primary node via a GSProcessCreate message. Upon receipt, Global Services sends a GSProcessCreateResponse.

After sending the response, it selects a Local Services to deploy the program to. After selection it sends this request as a SHProcessCreate to the selected Local Services. This Local Services sends a SHProcessCreateResposne as receipt.

Route User Program stdout (A25-A26)

Local Services ingests all the User Program stdout and stderr. This is propogated to the Frontend as a SHFwdOutput message that contains information about what process the strings originated from. With this information, the Frontend can provide the output via its stdout so the user can see it buffered cleanly and correctly with varying levels of verbosity.

Multi Node Teardown

In the multi-node case, there are a few more messages that are processed by Local Services than in the single-node case. The GSHalted message is forwarded by Local Services to the launcher backend. The SHHaltTA is sent to Local Services, and it forwards the TAHalted message to the backend when received. Other than these three additional messages, the tear down is identical to the single node version of tear down. One difference is that this tear down process is repeated on every one of the compute nodes in the case of the multi-node tear down.

In an abnormal situation, the AbnormalTermination message may be received by the Launcher from either Local Services or Global Services (via the Backend). In that case, the launcher will initiate a teardown of the infrastructure starting with sending of GSTeardown (message 5 in diagram below).

Sequence diagram

Launcher Front EndTCP Overlay Front EndLauncher Back EndTCP Overlay Back EndLocal ServicesTransport AgentGlobal ServicesUser Program EntryMain Process running[A1]head proc exit[M1]OS process exit[A2]notify GS of exit[M2]GS chSHProcessExit[A3]clean up existing globals[A4]notify launcher of exit[M3]BE chGSHeadExit[M4]prime LA chGSHeadExit[A5]new head process ORstart teardown[A6]halt global services[M5]prime BE chGSTeardown[M6]GS chGSTeardown[A7]detach from channels[M7]stdoutGSHalted[A8]GS exit[M8]BE chGSHalted[M9]LA chGSHalted[A9]halt TA[M10]BE chSHHaltTA #N[M11]LS chSHHaltTA[M12]TA chSHHaltTA[A10]quiesce, timeout[M13]LS chTAHalted[A11]TA exit[M14]BE chTAHalted[M15]LA chTAHalted #N[A12]gather TAHalted[A13]halt LS and BE[M16]BE chSHTeardown #N[M17]LS chSHTeardown[A14]detach from dragon logging channel[M18]BE chSHHaltBE[M19]LA chSHHaltBE #N[A15]wait on N SHHaltBE[A16]Continue BE and LS halting[M20]BE chBEHalted[M21]stdoutBEHalted[A17]deallocate shared mem[A18]exit[A19]shutdown overlay[M22]CTRL chBEHaltOverlay[A20]quiesce, timeout[M23]CTRL chOverlayHalted[A21]exit[A22]wait on LS and Overlay exit[A23]destroy memory[A24]exit[M24]WLM exit code[A25]wait on WLM to return[M25]CTRL chLAHaltOverlay[A26]quiesce, timeout[M26]CTRL chOverlayHalted[A27]exit[A28]deallocate shared mem[A29]exitDragon runtime is down

Multi-Node Teardown Sequence

Notes on Teardown Sequence

Head Proc Exit (A1-A4)

Local Services monitors its managed processes via waitpid. Once it registers an exit, it matches its pid to the internally tracked and globally unique puid. This puid is transmitted via SHProcessExit to Global Services.

Global Services cleans up any resources tied to tracking that specific head process. Once that is complete, it alerts the Frontend via a GSHeadExit.

Local Services Death Watcher (M1):

Local Services’ main thread has a thread repeatedly calling waitpid. This thread will receive the exit code from the head process.

Begin Teardown (A5-A7)

With exit of the head process, as of this writing (03/24/2023), the Launcher will begin teardown.

Teardown is initiated via the GSTeardown message to Global Services. Once this message is sent, every thing is ignored from user input. Teardown is what is being done. GSTeardown is also the message sent via a SIGINT signal from the user (assuming the full runtime is up).

Consequently with a full runtime, GSTeardown is always the first point of entry for exiting the Dragon runtime no matter the state of various individual runtime services.

Halt Global Services (A7-A9)

Once Global Services receives the GSTeardown, it detaches from any channels. It does not destroy any (aside from any it may have directly created) because Local Services has created the channels used for infrastructure and manages the memory created for them.

Once it has shutdown its resources, it transmits GSHalted to Local Services over stdout and exits. This is forwarded all the way to the Frontend.

Halt Transport Agent (A9-A12)

After the Frontend receives GSHalted, it initiates teardown of all Backend Transport Agents via SHHaltTA which is eventually routed to the transport agent from Local Services.

Upon receipt, it cancels any outstanding work it has, sends TAHalted over its channel to Local Services (NOTE: This should be over stdout) and exits.

Local Services forwards the exit to the Frontend via routing of a TAHalted messages. The Frontend waits for receipt of N messages before continuing (NOTE: This can easily result in hung processes and should be addressed).

Issue SHTeardown (A13-A16)

This is the first step in the more carefully orchestrated destruction of the Backend. Once Local Services receives SHTeardown, it detaches from its logging channel and tranmits a SHHaltBE to the Backend.

After destroying its logging infrastructure, the Backend forwards the SHHaltBE to the Frontend. The Frontend waits till it receives N SHHaltBE messages before exiting. (NOTE: Another potential mess).

Issue BEHalted (A17-A19)

Frontend sends BEHalted to all Backends. Once sent, there is no more direct communication between Backend and Frontend. The Frontend will simply confirm completion via exit codes from the workload manager.

Once Local Services receives its BEHalted, it deallocates all its allocated memory and exits.

Shutdown Backend and its overlay (A20-A25)

After transmitting BEHalted to Local Services, the Backend issues SHHaltTA to its overlay transport agent. This triggers teardown identical to the transport referenced in Halt Transport Agent (A9-A12).

Once the overlay is down and Local Services is down as measured by a simple wait on child process exits, the Backend deallocated its managed memory, including logging infrastructure and exits.

Shutdown Frontend (A26-A30)

After the workload manager returns exit code for the Backend exit, the Frontend shuts down its overlay similar to the Backend in Shutdown Backend and its overlay (A20-A25). After waiting on the successful overlay exit, it similarly deallocates managed memory and exits.

After the Frontend exits, the Dragon runtime is down.