Command Line Interface
Dragon provides a couple of different command line interfaces (CLIs) that allows users to interact with the Dragon runtime and its components.
dragon - The
dragoncommand is used to start the Dragon runtime services and user applications.dragon-config - The
dragon-configis used to set configuration options for the Dragon runtime.dragon-cleanup - The
dragon-cleanupcommand is used to clean up Dragon runtime services and user applications in either a single or multi-node environment.drun - The
druncommand uses an ssh-tree to run user applications on a set of hostnames.dhosts - The
dhostscommand opens an interactive shell configured to run applications on a specified list of hostnames.
dragon
Dragon Launcher Arguments and Options
usage: dragon [-h] [--hostlist HOSTLIST | --hostfile HOSTFILE]
[--network-prefix NETWORK_PREFIX]
[--network-config NETWORK_CONFIG] [-w WORKLOAD_MANAGER]
[-p PORT] [--overlay-port OVERLAY_PORT]
[--frontend-port FRONTEND_PORT] [-t TRANSPORT_AGENT]
[-o OVERLAY_TRANSPORT_AGENT] [-s | -m] [-l LOG_LEVEL] [-r]
[-N NODE_COUNT] [-i IDLE_COUNT] [-e] [-T TELEM_LEVEL] [-b]
[--no-label] [--basic-label] [--verbose-label] [--version]
[PROG] ...
Positional Arguments
- PROG
PROG specifies an executable program to be run on the primary compute node. In this case, the file may be either executable or not. If PROG is not executable, then Python version 3 will be used to interpret it. All command-line arguments after PROG are passed to the program as its arguments. The PROG and ARGS are optional.
- ARG
Zero or more program arguments may be specified.
Default:
[]
Named Arguments
- --hostlist
Specify backend hostnames as a comma-separated list, eg: –hostlist host_1,host_2,host_3. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH
- --hostfile
Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH
- --network-prefix
NETWORK_PREFIX specifies the network prefix the dragon runtime will use to determine which IP addresses it should use to build multinode connections from. By default the regular expression r’^(hsn|ipogif|ib)d+$’ is used – the prefix for known HPE-Cray XC and EX high speed networks. If uncertain which networks are available, the following will return them in pretty formatting: dragon-network-ifaddrs –ip –no-loopback –up –running | jq. Prepending with srun may be necessary to get networks available on backend compute nodes
- --network-config
NETWORK_CONFIG specifies a YAML or JSON file generated via a call to the launcher’s network config tool that successfully generated a corresponding YAML or JSON file (eg: dragon-network-config –output-to-yaml) describing the available backend compute nodes specified either by a workload manager (this is what the tool provides). Alternatively, one can be generated manually as is needed in the case of ssh-only launch. An example with keywords and formatting can be found in the documentation
- -w, --wlm
Possible choices: slurm, pbs+pals, ssh, k8s, drun
Specify what workload manager is used. Currently supported WLMs are: slurm, pbs+pals, ssh, k8s, drun
- -p, --port
PORT specifies the port to be used for multinode communication. By default, 7575 is used.
- --overlay-port
OVERLAY_PORT specifies the port to be used for the dragon overlay network communication. By default, 6565 is used.
- --frontend-port
FRONTEND_PORT specifies the port to be used by the Overlay transport agent running on the Dragon frontend node. By default, 6566 is used.
- -t, --transport
Possible choices: hsta, tcp, configured
TRANSPORT_AGENT selects which transport agent will be used for backend node-to-node communication. In the absence of a dragon-hsta binrary, the TCP agent will be used. Currently supported agents are: hsta, tcp
Default:
configured- -o, --overlay-transport
Possible choices: hsta, tcp, configured
OVERLAY_TRANSPORT_AGENT selects which transport agent will be used for node-to-node communication on the overlay network, connecting the frontend to the backend nodes. In the absence of a dragon-hsta binary, the TCP agent will be used. Currently supported agents are: hsta, tcp
Default:
configured- -s, --single-node-override
Override automatic launcher selection to force use of the single node launcher
Default:
False- -m, --multi-node-override
Override automatic launcher selection to force use of the multi-node launcher
Default:
False- -l, --log-level
Possible choices: NONE, DEBUG, INFO, WARNING, ERROR, CRITICAL, stderr=NONE, stderr=DEBUG, stderr=INFO, stderr=WARNING, stderr=ERROR, stderr=CRITICAL, dragon_file=NONE, dragon_file=DEBUG, dragon_file=INFO, dragon_file=WARNING, dragon_file=ERROR, dragon_file=CRITICAL, actor_file=NONE, actor_file=DEBUG, actor_file=INFO, actor_file=WARNING, actor_file=ERROR, actor_file=CRITICAL
The Dragon runtime enables the output of diagnostic log messages to multiple different output devices. Diagnotic log messages can be seen on the Dragon stderr console, via a combined ‘dragon_*.log’ file, or via individual log files created by each of the Dragon ‘actors’ (Global Services, Local Services, etc).
By default, the Dragon runtime disables all diagnostic log messaging.
Passing one of NONE, DEBUG, INFO, WARNING, ERROR, or CRITICAL to this option, the Dragon runtime will enable the specified log verbosity. When enabling DEBUG level logging, the Dragon runtime will limit the stderr and combined dragon log file to INFO level messages. Each actor’s log file will contain the complete log history, including DEBUG messages. This is done to help limit the number of messages sent between the Dragon frontend and the Dragon backend at scale.
To override the default logging behavior and enable specific logging to one or more Dragon output devices, the LOG_LEVEL option can be formatted as a keyword=value pair, where the KEYWORD is one of the Dragon log output devices (stderr, dragon_file or actor_file), and the VALUE is one of NONE, DEBUG, INFO, WARNING, ERROR or CRITICAL (eg -l dragon_file=INFO -l actor_file=DEBUG). Multiple -l|–log-level options may be passed to enable the logging desired.
Default:
{'DRAGON_LOG_DEVICE_STDERR': 'NONE', 'DRAGON_LOG_DEVICE_DRAGON_FILE': 'NONE', 'DRAGON_LOG_DEVICE_ACTOR_FILE': 'NONE'}- -r, --resilient
If used, the Dragon runtime will attempt to continue execution of the user app in the event of a hardware or user software error by falling back to functional hardware resources and omitting hardware where the given error occurred.
Default:
False- -N, --nodes
NODE_COUNT specifies the number of nodes to use. NODE_COUNT must be less or equal to the number of available nodes within the WLM allocation. A value of zero (0) indicates that all available nodes should be used (the default).
- -i, --idle
In conjuction with the –resilient flag, the specifies the number of nodes that will be held in reserve when the user application is run. In the event a node executing the user application experiences an error, the Dragon runtime will pull an “idle” node into the compute pool and begin executing the user application on it.
- -e, --exhaust-resources
When used with –resilient execution, the Dragon runtime will continue executing the user application in the event of any number of localized hardware errors until there are 0 nodes available for computation. If not used, the default behavior of executing until the number of nodes available is less than those requested via the –nodes argument
Default:
False- -T, --telemetry-level
The Dragon runtime enables native and user defined
telemetry. By default, the Dragon runtime disables all telemetry. Passing one of 1, 2, 3, 4, or 5 to this option, the Dragon runtime will enable the specified telemetry verbosity.
Default:
0- -b, --progress-bar
Enables a progress bar for HSTA request completions vs. the total number of expected request completions for the current launch configurarion, which is defined using the values in sys.argv and the number of nodes used for the launch. The first run with this configuration simply collects the necessary information to use a progress bar. Subsequent runs will display the application’s progress via the progress bar. Data collected during the first run will be stored in a file contained in a hidden .dragon directory in the current working directory from which the application was launched. This feature currently requires the use of a parallel file system such as Lustre or NFS.
Default:
False- --no-label
Default:
True- --basic-label
Default:
False- --verbose-label
Default:
False- --version
show program’s version number and exit
dragon-config
Configure the build and runtime environments for Dragon in regards to 3rd party libraries. This is needed for building network backends for HSTA, as well as for GPU support more generally. In future releases, this script may also be used for runtime configuration of libraries. Additionally, some options provide information about the Dragon installation to allow Dragon header files and libraries to be used in compiled applications
usage: dragon-config [-h] [-c] [--config-file CONFIG_FILE] [-s] [-g GET]
[-l | -o | -e]
{add,test} ...
Named Arguments
- -c, --clean
Clean out all config information.
Default:
False- --config-file
Point configuration to a custom config file. Largely intended for testing
- -s, --serialize
Serialize all key-value pairs currently in the configuration file into a single, colon-separated string that can be passed to the –add command.
Default:
False- -g, --get
Get value for given key that can be passed to the –add or –add-mpiexec command.
- -l, --linker-options
For execution during linking, print the linker option for build applications built against Dragon C/C++ API
Default:
False- -o, --compiler-options
For execution during compilation, print the compiler option for building applications built against Dragon C/C++ API
Default:
False- -e, --explicit-compiler-options
With brief description, print the compilation and link options for building C programs with Dragon and exit
Default:
False
Add and tests paths subparser
- add
Possible choices: add, test
Add paths for configuration, compilation, execution, and testing of Dragon
Sub-commands
add
Define a number of paths (key=value) to configure include and library paths for Dragon, or to make the TCP runtime the always-on default for backend communication (set to True).
Examples
UCX backend: dragon-config add –ucx-include=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/include dragon-config add –ucx-build-lib=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/lib dragon-config add –ucx-runtime-lib=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/lib
Set TCP transport as always-on default backend: dragon-config add –tcp-runtime
Set PMIx header files location to enable PMIx support for MPI applications. Specifically looking for path <pmix include>/src/include/pmix_globals.h dragon-config add –pmix-include=/usr/include:/usr/include/pmix
dragon-config add [-h] [--ofi-include OFI_INCLUDE] [--ucx-include UCX_INCLUDE]
[--pmix-include PMIX_INCLUDE] [--mpi-include MPI_INCLUDE]
[--cuda-include CUDA_INCLUDE] [--hip-include HIP_INCLUDE]
[--ze-include ZE_INCLUDE] [--ofi-build-lib OFI_BUILD_LIB]
[--ucx-build-lib UCX_BUILD_LIB]
[--ofi-runtime-lib OFI_RUNTIME_LIB]
[--ucx-runtime-lib UCX_RUNTIME_LIB]
[--cuda-runtime-lib CUDA_RUNTIME_LIB]
[--netconfig-mpiexec-override NETCONFIG_MPIEXEC_OVERRIDE]
[--backend-mpiexec-override BACKEND_MPIEXEC_OVERRIDE]
[--tcp-runtime]
Named Arguments
- --ofi-include
Include path for OFI headers to be used when building dragon
- --ucx-include
Include path for UCX headers to be used when building dragon
- --pmix-include
Include path for PMIx headers to be used when building dragon
- --mpi-include
Include path for MPI headers to be used when building dragon
- --cuda-include
Include path for CUDA headers to be used when building dragon
- --hip-include
Include path for HIP headers to be used when building dragon
- --ze-include
Include path for Ze headers to be used when building dragon
- --ofi-build-lib
Path to OFI libraries (eg: libfabric.so) to be used when building dragon
- --ucx-build-lib
Path to UCX libraries (eg: libucp.so) to be used when building dragon
- --ofi-runtime-lib
Path to OFI libraries (eg: libfabric.so) to be used during app exeuction
- --ucx-runtime-lib
Path to UCX libraries (eg: libucp.so) to be used during app execution
- --cuda-runtime-lib
Path to CUDA libraries (eg: libcudart.so) to be used during app execution
- --netconfig-mpiexec-override
Add mpiexec override commands for Dragon’s PBS+PALS launcher. This is used to add overrides for the mpiexec commands used to launch the network config tool and thedeprecated cleanup processes. The command needs to launch one process per node, line buffer the output, and tag the output with the process rank with some unique identifying information (global rank, hostname, etc). The commands should be passed as a single string. The following special strings are necessary and will be automatically filled in at the time of use by Dragon:
{nnodes} = number of nodes
Examples
Set launcher mpiexec network config override for Cray-PALS: $ dragon-config add –netconfig-mpiexec-override=’mpiexec –np {nnodes} -ppn 1 -l –line-buffer’
Set launcher mpiexec network config override for OpenMPI 5.0.6: $ dragon-config add –netconfig-mpiexec-override=’mpiexec –np {nnodes} –map-by ppr:1:node –stream-buffering=1 –tag-output’
These commands are used by default when the dragon launcher detects PBS+PALS.
To avoid checks with the automatic wlm detection and utilize the overriden mpiexec commands, run dragon with the workload manager specified as ‘–wlm=pbs+pals’.
- --backend-mpiexec-override
Add mpiexec override commands for Dragon’s PBS+PALS launcher. This is used to add overrides for the mpiexec commands used to launch the backend processes. The command should be passed as a single string. The following special strings are necessary and will be automatically filled in at the time of use by Dragon:
{nodes} = number of nodes, {nodelist} = comma separated list of nodes
Examples
Set launcher mpiexec backend launch override for Cray-PALS: $ dragon-config add –backend-mpiexec-override=’mpiexec –np {nnodes} –ppn 1 –cpu-bind none –hosts {nodelist} –line-buffer’
Set launcher mpiexec backend launch override for OpenMPI 5.0.6: $ dragon-config add –backend-mpiexec-override=’mpiexec –np {nnodes} –map-by ppr:1:node –stream-buffering=1 –tag-output –host {nodelist}’
These commands are used by default when the dragon launcher detects PBS+PALS.
To avoid checks with the automatic wlm detection and utilize the overriden mpiexec commands, run dragon with the workload manager specified as ‘–wlm=pbs+pals’.
- --tcp-runtime
If only using TCP for backend communication, set in order to turn off warning message during initialization of runtime
Default:
False
test
Define paths necessary for executing tests of Dragon’s MPI application support
Examples
Set paths for headers and libraries for Cray MPICH, Open MPI, or ANL MPICH installations. dragon-config test –cray-mpich=/opt/cray/pe/lmod/modulefiles/comnet/gnu/12.0/ofi/1.0/cray-mpich dragon-config test –open-mpi=/lus/scratch/dragonhpc/openmpi dragon-config test –anl-mpich=/lus/scratch/dragonhpc/mpich
dragon-config test [-h] [--cray-mpich CRAY_MPICH] [--open-mpi OPEN_MPI]
[--anl-mpich ANL_MPICH]
Named Arguments
- --cray-mpich
Path to Cray MPICH installation
- --open-mpi
Path to Open MPI installation
- --anl-mpich
Path to ANL MPICH installation
dragon-cleanup
The dragon-cleanup tool identifies and/or removes residual dragon runtime services and user applications from previous runs in either a single or multi-node environment. This is particularly useful when an execution of Dragon fails to exit cleanly, leaving behind orphaned processes or resources that could interfere with subsequent runs.
The tool automatically detects if a Workload Manager (WLM)—such as Slurm or PBS—was used for node allocation. If a WLM is present, dragon-cleanup targets those active nodes. Nodes can also be manually specified via the –hostlist or –hostfile arguments, or leverage the dhosts utility to set the DRAGON_RUN_NODEFILE environment variable.
To ensure efficiency across large multi-node environments, dragon-cleanup utilizes an SSH-tree to launch cleanup processes on each node. This requires that all nodes are configured for password-less SSH and maintain mutual routability.
Example usage:
- dragon-cleanup –hostlist host1,host2,host3
Manually specify that dragon-cleanup should run on host1, host2 and host3.
- dragon-cleanup –hostfile my_hostfile.txt –dry-run
Specify that dragon-cleanup should run in dry-run mode on the hosts specified in my_hostfile.txt. In dry-run mode, dragon-cleanup will print the processes and resources it would clean up, but won’t actually make any changes.
- dragon-cleanup –wlm slurm
Force dragon-cleanup to look for an active Slurm allocation and run on the nodes from that allocation.
usage: dragon-cleanup [-h] [--wlm WORKLOAD_MANAGER]
[--hostlist HOST_LIST | --hostfile HOST_LIST] [-s | -m]
[--dry-run] [--resilient] [--timeout TIMEOUT]
[--only-be]
Named Arguments
- --wlm
Possible choices: slurm, pbs, ssh
Specify what workload manager is used. Currently supported WLMs are: slurm, pbs, ssh
- --hostlist
Specify backend hostnames as a comma-separated list, eg: –hostlist host_1,host_2,host_3. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH
Default:
[]- --hostfile
Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH
Default:
[]- -s, --single-node-override
Override automatic launcher selection to force use of the single node launcher
Default:
False- -m, --multi-node-override
Override automatic launcher selection to force use of the multi-node launcher
Default:
False- --dry-run
Dry run. Don’t actually make changes
Default:
False- --resilient
Prevent removing resources to enable resilient restart of the Dragon runtime
Default:
False- --timeout
Time to wait when terminating a process before killing it.
Default:
2- --only-be
Only teardown Dragon processes on backend compute nodes.
Default:
False
drun
The DragonRun (drun) utility is used to launch applications on a set of hosts.
The tool automatically detects if a Workload Manager (WLM)—such as Slurm or PBS—was used for node allocation. If a WLM is present, drun targets those active nodes. Nodes can also be manually specified via the –hostlist or –hostfile arguments, or leverage the dhosts utility to set the DRAGON_RUN_NODEFILE environment variable.
To ensure efficiency across large multi-node environments, drun utilizes an SSH-tree to launch processes on each node. This requires that all nodes are configured for password-less SSH and maintain mutual routability.
Example usage:
- drun –hostlist host1,host2,host3 my_executable –option1 –option2
Manually specify a list of hosts, in this case, host1, host2 and host3, on which to run my_executable with options –option1 and –option2.
- drun –hostfile my_hostfile.txt my_executable
Specify a file containing a list of hosts, in this case, my_hostfile.txt, on which to run my_executable.
- drun –wlm slurm my_executable
Force drun to look for an active Slurm allocation and use the nodes from that allocation to run my_executable.
usage: drun [-h] [--wlm WORKLOAD_MANAGER]
[--hostlist HOST_LIST | --hostfile HOST_LIST] [-s | -m]
[--export {ALL,NONE}] [--env KEY=VALUE] [--include-fe]
[--fanout FANOUT] [-l LOG_LEVEL]
...
Positional Arguments
- USER_CMD
The executable, including any command line options, to execute on the remote nodes.
Default:
[]
Named Arguments
- --wlm
Possible choices: slurm, pbs, ssh
Specify what workload manager is used. Currently supported WLMs are: slurm, pbs, ssh
- --hostlist
Specify backend hostnames as a comma-separated list, eg: –hostlist host_1,host_2,host_3. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH
Default:
[]- --hostfile
Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH
Default:
[]- -s, --single-node-override
Override automatic launcher selection to force use of the single node launcher
Default:
False- -m, --multi-node-override
Override automatic launcher selection to force use of the multi-node launcher
Default:
False- --export
Possible choices: ALL, NONE
Identify which environment variables from the submission environment are propagated to the launched application.
Default:
'NONE'- --env
Environment variables to set in the remote environment. Example: –env DEBUG=True
Default:
{}- --include-fe
In addition to running the given command on the dragon backend node, also run the command on the dragon frontend.
Default:
False- --fanout
DragonRun uses a fanout tree to effeciently communicate with its backend nodes. This value sets the number of children each node in this fanout tree talks to.
Default:
16- -l, --log-level
Possible choices: CRITICAL, FATAL, ERROR, WARN, WARNING, INFO, DEBUG, NOTSET
Enables the output of diagnostic log messages. By default, the DragonRun runtime disables all diagnostic log messaging. Passing one of NOTSET, DEBUG, INFO, WARNING, ERROR, or CRITICAL to this option, the Dragon runtime will enable the specified log verbosity.
Default:
'NOTSET'
dhosts
The dhosts utility defines the list of hosts that should be used by other Dragon runtime tools. To do this, dhosts generates a temporary hostfile and exports the DRAGON_RUN_NODEFILE environment variable within a subshell. To generate the host list, dhosts first attempts to detect an active Workload Manager (WLM) allocation, such as from Slurm or PBS. If no WLM is present, or if dhosts is unable to detect the allocated nodes from the WLM, the list of hosts can be specified manually via the –hostlist or –hostfile options.
This is useful for running other dragon tools on a specific set of hosts without having to specify the list of hosts to each tool individually. Since dhosts exports the DRAGON_RUN_NODEFILE environment variable, any tool that relies on this environment variable can automatically use the generated hostlist. For example, dragon-cleanup will automatically use the hostlist generated by dhosts if DRAGON_RUN_NODEFILE is set in the environment.
To ensure efficiency across large multi-node environments, the DragonRun (drun) launcher utilizes an SSH-tree to launch processes on each node. This requires that all nodes are configured for password-less SSH and maintain mutual routability.
Example usage:
- dhosts –hostlist host1,host2,host3
Manually specify a list of hosts, in this case, host1, host2 and host3.
- dhosts –hostfile my_hostfile.txt
Specify a file containing a list of hosts, in this case, my_hostfile.txt.
- dhosts –wlm slurm
Force dhosts to look for an active Slurm allocation and use the nodes from that allocation.
- dhosts –list
Print the list of hosts that dhosts has determined should be used in the current environment.
usage: dhosts [-h] [--wlm WORKLOAD_MANAGER]
[--hostlist HOST_LIST | --hostfile HOST_LIST] [--list]
Named Arguments
- --wlm
Possible choices: slurm, pbs, ssh
Specify what workload manager is used. Currently supported WLMs are: slurm, pbs, ssh
- --hostlist
Specify backend hostnames as a comma-separated list, eg: –hostlist host_1,host_2,host_3. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH
Default:
[]- --hostfile
Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH
Default:
[]- --list
List known hosts in the current dragon run environment
Default:
False