Running Across Nodes
To run in multinode mode, Dragon must know what resources are available for its use on the compute backend. Dragon natively supports the Slurm and PBS Workload Managers (WLMs) and can, in most cases, automatically detect the allocated resources when running within an active job on a cluster or supercomputer.
There are cases, however, when no traditional WLM is present, Dragon can’t automatically detect the available resources, or perhaps a subset of the available resources should be used. In these cases, the dragon Launcher supports both the DragonRun and SSH Lightweight Workload Managers (WLM).
The following multinode configurations are supported:
Running Dragon with a Traditional Workload Manager
To launch a Dragon program on several compute nodes, a Work Load Manager job allocation
obtained via salloc or sbatch (Slurm) or qsub (PBS+Pals) is required, eg:
$ salloc --nodes=2
$ dragon myprog.py arg1 arg2 ...
In the event that Dragon is run outside of an active WLM allocation an exception is raised, and the program will not execute:
$ dragon p2p_lat.py --iterations 100 --lg_max_message_size 12 --dragon
RuntimeError: Executing in a Slurm environment, but with no job allocation.
Resubmit as part of an 'salloc' or 'sbatch' execution
To override this default behavior and execute a Dragon program on the same node as your shell,
the --single-node-override / -s option is available.
The Dragon runtime assumes all nodes in an allocation are to be used unless the --nodes option is specified.
This limits the user program to executing on a smaller subset of nodes, potentially useful for execution of scaling
benchmarks. For example, if the user has a job allocation for 4 nodes, but only wants to use 2 for their Dragon program,
they may do the following:
$ salloc --nodes=4
$ dragon --nodes 2 p2p_lat.py --iterations 100 --lg_max_message_size 12 --dragon
Running Dragon using either the DragonRun or SSH WLM
The DragonRun and SSH WLMs are Lightweight Workload Managers (WLM) built into Dragon. These can be used on a generic cluster without a traditional WLM, or any time Dragon needs to be run on a set of backend resources that otherwise can’t be automatically detected. This includes cases where a traditional WLM is present but perhaps only a specific subset of the allocated nodes should be used, or when Dragon is not able to accurately detect the allocated nodes from the traditional WLM.
The DragonRun WLM uses an ssh-based ‘command and control tree’ to efficiently fan out the launch of the Dragon Runtime on the backend compute notes. By using this tree-based launch mechanism, the DragonRun WLM can successfully launch on large numbers of nodes with minimal load on the frontend Dragon launcher.
The soon to be deprecated SSH WLM uses a more traditional one-to-many SSH launch mechanism. Here the frontend Dragon Launcher SSH’s to each backend compute node individually in order to launch the Dragon runtime. This one-to-many SSH launch mechanism can cause significant load on the frontend Dragon launcher when launching on large numbers of nodes, and is therefore not recommended for large scale runs. For small scale runs, the SSH WLM can be used as a simple alternative to the DragonRun WLM if desired.
The following sections describe how to use the DragonRun and SSH WLM options:
Using the DragonRun WLM:
To use the DragonRun WLM, the following options must be provided on the dragon launcher command line:
Select the DragonRun (drun) SSH Workload Manager
The
--wlm drun / -w drunoption tells the dragon launcher to use the DragonRun launch WLM.
Provide available backend compute resources
The list of available backend compute resources can be provided to the dragon launcher in one of several ways
Note: Dragon requires that all nodes are configured for password-less SSH and maintain mutual routability.
Using the SSH WLM:
To use the SSH WLM, the following options must be provided on the dragon launcher command line:
Select the SSH SSH Workload Manager
The
--wlm ssh / -w sshoption tells thedragonlauncher to use the SSH launch WLM.
Provide available backend compute resources
The list of available backend compute resources can be provided to the dragon launcher in one of several ways
Note: Dragon requires that all nodes are configured for password-less SSH and maintain mutual routability.
Using the Dragon Hosts (dhosts) utility
The dhosts utility defines the list of hosts that should be used by other Dragon
runtime tools. To do this, dhosts generates a temporary hostfile and exports the
DRAGON_RUN_NODEFILE environment variable within a subshell. To generate the host
list, dhosts first attempts to detect an active Workload Manager (WLM)
allocation, such as from Slurm or PBS. If no WLM is present, or if dhosts is unable
to detect the allocated WLM nodes, the list of hosts can be specified manually via the
--hostlist or --hostfile options.
This is useful for running other dragon tools on a specific set of hosts without having to specify the list of hosts to each tool individually. Since dhosts exports the DRAGON_RUN_NODEFILE environment variable, any tool that relies on this environment variable can automatically use the generated hostlist. For example, dragon-cleanup will automatically use the hostlist generated by dhosts if DRAGON_RUN_NODEFILE is set in the environment.
To provide the available nodes explicitly on the dhosts command line, specify the available
backend hostnames as a comma-separated list, eg: --hostlist host_1,host_2,host_3.
$ dhosts --hostlist host_1,host_2,host_3
$ echo $DRAGON_RUN_NODEFILE
/tmp/dragon_run_nodefile_12345
$ cat $DRAGON_RUN_NODEFILE
host_1
host_2
host_3
To provide the available nodes via a text file, create a newline separated text file with each
backend node’s hostname on a separate line. Pass the name of the text file to the dhosts
command line, eg: --hostfile hosts.txt.
$ cat hosts.txt
host_1
host_2
host_3
$ dhosts --hostfile hosts.txt
$ echo $DRAGON_RUN_NODEFILE
/tmp/dragon_run_nodefile_12345
$ cat $DRAGON_RUN_NODEFILE
host_1
host_2
host_3
NOTE: You cannot use both --hostfile and --hostlist on the commandline at the same time.
Providing a Host List or Host File
Providing a list of hosts to the dragon launcher can be done either by listing them explicitly on the dragon command-line or by providing the dragon launcher the name of a newline seperated text file containing the list of host names.
To provide the available nodes explicitly on the dragon command line, specify the available
backend hostnames as a comma-separated list, eg: --hostlist host_1,host_2,host_3.
$ dragon -w drun -t tcp --hostlist host_1,host_2,host_3 [PROG]
To provide the available nodes via a text file, create a newline separated text file with each
backend node’s hostname on a separate line. Pass the name of the text file to the dragon
launcher, eg: --hostfile hosts.txt.
$ cat hosts.txt
host_1
host_2
host_3
$ dragon -w drun -t tcp --hostfile hosts.txt [PROG]
NOTE: You cannot use both --hostfile and --hostlist on the commandline at the same time.
When passing the list of available backend nodes in either of these ways, the dragon launcher needs to determine basic network configuration settings for each listed node before it can launch the Dragon user application. This is done by launching a utility application on each listed node to report the node’s IP and other relevant information. Running this utility application slightly delays the startup of Dragon. To prevent this delay, you can instead generate a Dragon network-config file as explained below.
Providing a Dragon Network-Config File
Dragon provides a utility application to gather and persist relevant network information from it’s backend compute resorces. This utility can be used to generate a persistent YAML or JSON configuration which, when passed to the dragon launcher, provides all required information about a set of backend compute nodes.
To generate a network configuration file for a given set of backend compute nodes, run the
dragon-network-config tool as shown below:
$ dragon-network-config -w drun --hostlist host1,host2,host3,host4 -j
$ ls ssh.json
ssh.json
Once you have a network configuration file, the name of the configuration file can be passed to the dragon launcher to identify the available backend compute resources:
$ dragon -w drun -t tcp --network-config ssh.json [PROG]
NOTE: Changes to the backend compute node’s IP addresses or other relevant network
settings will invalidate the saved network config file. If this happens, please
re-run the dragon-network-config tool to collect updated information.
The dragon-network-config help is below:
usage: dragon-network-config [-h] [-p PORT] [--network-prefix NETWORK_PREFIX] [--wlm WORKLOAD_MANAGER] [--log] [--output-to-yaml] [--output-to-json]
[--no-stdout] [--primary PRIMARY] [--hostlist HOSTLIST | --hostfile HOSTFILE]
Runs Dragon internal tool for generating network topology
optional arguments:
-h, --help show this help message and exit
-p PORT, --port PORT Infrastructure listening port (default: 6565)
--network-prefix NETWORK_PREFIX
NETWORK_PREFIX specifies the network prefix the dragon runtime will use to determine which IP addresses it should use to build
multinode connections from. By default the regular expression r'^(hsn|ipogif|ib|eth)\w+$' is used -- the prefix for known HPE-Cray XC
and EX high speed networks. If uncertain which networks are available, the following will return them in pretty formatting: `dragon-
network-ifaddrs --ip --no-loopback --up --running | jq`. Prepending with `srun` may be necessary to get networks available on
backend compute nodes
--wlm WORKLOAD_MANAGER, -w WORKLOAD_MANAGER
Specify what workload manager is used. Currently supported WLMs are: slurm, pbs+pals, ssh
--log, -l Enable debug logging
--output-to-yaml, -y Output configuration to YAML file
--output-to-json, -j Output configuration to JSON file
--no-stdout Do not print the configuration to stdout
--primary PRIMARY Specify the hostname to be used for the primary compute node
--hostlist HOSTLIST Specify backend hostnames as a comma-separated list, eg: `--hostlist host_1,host_2,host_3`. `--hostfile` or `--hostlist` is a
required argument for WLM SSH and is only used for SSH
--hostfile HOSTFILE Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames.
`--hostfile` or `--hostlist` is a required argument for WLM SSH and is only used for SSH
# To create YAML and JSON files with a slurm WLM:
$ dragon-network-config --wlm slurm --output-to-yaml --output-to-json
Formatting of the network-config file appears below for both JSON and YAML:
1'0':
2 h_uid: null
3 host_id: 18446744071562724608
4 ip_addrs:
5 - 10.128.0.5:6565
6 is_primary: true
7 name: nid00004
8 num_cpus: 0
9 physical_mem: 0
10 shep_cd: ''
11 state: 4
12'1':
13 h_uid: null
14 host_id: 18446744071562724864
15 ip_addrs:
16 - 10.128.0.6:6565
17 is_primary: false
18 name: nid00005
19 num_cpus: 0
20 physical_mem: 0
21 shep_cd: ''
22 state: 4
1{
2 "0": {
3 "state": 4,
4 "h_uid": null,
5 "name": "nid00004",
6 "is_primary": true,
7 "ip_addrs": [
8 "10.128.0.5:6565"
9 ],
10 "host_id": 18446744071562724608,
11 "num_cpus": 0,
12 "physical_mem": 0,
13 "shep_cd": ""
14 },
15 "1": {
16 "state": 4,
17 "h_uid": null,
18 "name": "nid00005",
19 "is_primary": false,
20 "ip_addrs": [
21 "10.128.0.6:6565"
22 ],
23 "host_id": 18446744071562724864,
24 "num_cpus": 0,
25 "physical_mem": 0,
26 "shep_cd": ""
27 }
28}
When nodes have multiple available NICs, attention should be paid to the number and order of
IP addresses specified in the network configuration file. Because the dragon-network-config
utility has no way of knowing which of the multiple NICs and IP addresses should be used
preferentially on a given node, the list of “ip_addrs” specified in the network config
YAML/JSON file may need to be manually adjusted to ensure the preferred IP address is first
in the list. This manual review and ordering adjustment is only necessary when some NICs can
and some NICs can not route to other nodes in the Dragon cluster.
Although not specified as part of the network configuration, if the frontend node also has
multiple NICs and only some have available routes to the compute nodes, it is possible to
specify the routable IP address (and thereby NIC) to use on the frontend node for all
communications with the compute nodes via the environment variable, DRAGON_FE_IP_ADDR.
A toy example showcasing how to specify which NIC to use of the frontend / head node
while simultaneously specifying which NICs to use on the compute nodes (via the network
config JSON file):
# Note that the value "1.2.3.4" should be replaced with the appropriate local IP address.
$ DRAGON_FE_IP_ADDR="1.2.3.4:6566" dragon --wlm drun --network-config my_cluster_config.json --network-prefix '' my_user_code.py
High Speed Transport Agent (HSTA)
HSTA is a high-speed transport agent that provides MPI-like performance using
Dragon Channels. HSTA uses libfrabric or libucp for communication over Slingshot
or Infiniband high-speed interconnection networks. If you have one of these networks
you can configure HSTA to run on it using the appropriate dragon-config options. See
the Installation section for examples of how to configure Dragon to use HSTA.
The HSTA transport agent is currently not available in the opensource version of Dragon. For inquiries about Dragon’s high speed RDMA-based transport, please contact HPE by emailing dragonhpc@hpe.com .
TCP-based Transport Agent
The TCP-based transport agent is the default transport agent for the Dragon opensource package. The TCP transport agent utilizes standard TCP for inter-node communication through Dragon Channels.
When using a version of Dragon that includes the HSTA transport agent and you prefer to
use the TCP transport agent, the --transport tcp option can be passed to the launcher (see:
FAQ and Launcher options). The dragon-config
command can also be used to specify that the TCP transport should be used. To do that you run
dragon-config as follows.
The TCP agent is configured to use port 7575 by default. If that port is blocked,
it can be changed with the --port argument to dragon. If not specific,
7575 is used:, eg:
# Port 7575 used
$ dragon --nodes 2 p2p_lat.py --iterations 100 --lg_max_message_size 12 --dragon
# Port 7000 used
$ dragon --port 7000 --nodes 2 p2p_lat.py --iterations 100 --lg_max_message_size 12 --dragon
The TCP transport agent also favors known Cray high-speed interconnect networks by default. This is accomplished via
regex specification of the network’s named prefix matchin ipogif (Aries) or hsn (Slingshot): r'^(hsn|ipogif)d+$'.
To change, for example, to match only hsn networks, the --network-prefix argument could be used:
$ dragon --network-prefix hsn --nodes 2 p2p_lat.py --iterations 100 --lg_max_message_size 12 --dragon
Known Issue: If a --network-prefix argument is given that doesn’t actually exist, the Dragon runtime will enter
a hung state. This will be fixed in future releases. For now, a ctrl+z and kill will be necessary to recover.