Command Line Interface
Dragon provides a couple of different command line interfaces (CLIs) that allows users to interact with the Dragon runtime and its components.
dragon - The
dragoncommand is used to start the Dragon runtime services and user applications.dragon-config - The
dragon-configis used to set configuration options for the Dragon runtime.
dragon
Dragon Launcher Arguments and Options
usage: dragon [-h] [--hostlist HOSTLIST | --hostfile HOSTFILE]
[--network-prefix NETWORK_PREFIX]
[--network-config NETWORK_CONFIG] [--wlm WORKLOAD_MANAGER]
[-p PORT] [--overlay-port OVERLAY_PORT]
[--frontend-port FRONTEND_PORT] [--transport TRANSPORT_AGENT]
[-s | -m] [-l LOG_LEVEL] [-r] [-N NODE_COUNT] [-i IDLE_COUNT]
[-e] [-T TELEM_LEVEL] [-b] [--no-label] [--basic-label]
[--verbose-label] [--version]
[PROG] ...
Positional Arguments
- PROG
PROG specifies an executable program to be run on the primary compute node. In this case, the file may be either executable or not. If PROG is not executable, then Python version 3 will be used to interpret it. All command-line arguments after PROG are passed to the program as its arguments. The PROG and ARGS are optional.
- ARG
Zero or more program arguments may be specified.
Default:
[]
Named Arguments
- --hostlist
Specify backend hostnames as a comma-separated list, eg: –hostlist host_1,host_2,host_3. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH
- --hostfile
Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH
- --network-prefix
NETWORK_PREFIX specifies the network prefix the dragon runtime will use to determine which IP addresses it should use to build multinode connections from. By default the regular expression r’^(hsn|ipogif|ib)d+$’ is used – the prefix for known HPE-Cray XC and EX high speed networks. If uncertain which networks are available, the following will return them in pretty formatting: dragon-network-ifaddrs –ip –no-loopback –up –running | jq. Prepending with srun may be necessary to get networks available on backend compute nodes
- --network-config
NETWORK_CONFIG specifies a YAML or JSON file generated via a call to the launcher’s network config tool that successfully generated a corresponding YAML or JSON file (eg: dragon-network-config –output-to-yaml) describing the available backend compute nodes specified either by a workload manager (this is what the tool provides). Alternatively, one can be generated manually as is needed in the case of ssh-only launch. An example with keywords and formatting can be found in the documentation
- --wlm, -w
Possible choices: slurm, pbs+pals, ssh, k8s, drun
Specify what workload manager is used. Currently supported WLMs are: slurm, pbs+pals, ssh, k8s, drun
- -p, --port
PORT specifies the port to be used for multinode communication. By default, 7575 is used.
- --overlay-port
OVERLAY_PORT specifies the port to be used for the dragon overlay network communication. By default, 6565 is used.
- --frontend-port
FRONTEND_PORT specifies the port to be used by the Overlay transport agent running on the Dragon frontend node. By default, 6566 is used.
- --transport, -t
Possible choices: hsta, tcp, configured
TRANSPORT_AGENT selects which transport agent will be used for backend node-to-node communication. By default, Dragon consults the files created by running dragon-config. Run dragon-config –help for more information. In the absence of dragon-config files the TCP agent will be used. Currently supported agents are: hsta, tcp
Default:
configured- -s, --single-node-override
Override automatic launcher selection to force use of the single node launcher
Default:
False- -m, --multi-node-override
Override automatic launcher selection to force use of the multi-node launcher
Default:
False- -l, --log-level
Possible choices: NONE, DEBUG, INFO, WARNING, ERROR, CRITICAL, stderr=NONE, stderr=DEBUG, stderr=INFO, stderr=WARNING, stderr=ERROR, stderr=CRITICAL, dragon_file=NONE, dragon_file=DEBUG, dragon_file=INFO, dragon_file=WARNING, dragon_file=ERROR, dragon_file=CRITICAL, actor_file=NONE, actor_file=DEBUG, actor_file=INFO, actor_file=WARNING, actor_file=ERROR, actor_file=CRITICAL
The Dragon runtime enables the output of diagnostic log messages to multiple different output devices. Diagnotic log messages can be seen on the Dragon stderr console, via a combined ‘dragon_*.log’ file, or via individual log files created by each of the Dragon ‘actors’ (Global Services, Local Services, etc).
By default, the Dragon runtime disables all diagnostic log messaging.
Passing one of NONE, DEBUG, INFO, WARNING, ERROR, or CRITICAL to this option, the Dragon runtime will enable the specified log verbosity. When enabling DEBUG level logging, the Dragon runtime will limit the stderr and combined dragon log file to INFO level messages. Each actor’s log file will contain the complete log history, including DEBUG messages. This is done to help limit the number of messages sent between the Dragon frontend and the Dragon backend at scale.
To override the default logging behavior and enable specific logging to one or more Dragon output devices, the LOG_LEVEL option can be formatted as a keyword=value pair, where the KEYWORD is one of the Dragon log output devices (stderr, dragon_file or actor_file), and the VALUE is one of NONE, DEBUG, INFO, WARNING, ERROR or CRITICAL (eg -l dragon_file=INFO -l actor_file=DEBUG). Multiple -l|–log-level options may be passed to enable the logging desired.
Default:
{'DRAGON_LOG_DEVICE_STDERR': 'NONE', 'DRAGON_LOG_DEVICE_DRAGON_FILE': 'NONE', 'DRAGON_LOG_DEVICE_ACTOR_FILE': 'NONE'}- -r, --resilient
If used, the Dragon runtime will attempt to continue execution of the user app in the event of a hardware or user software error by falling back to functional hardware resources and omitting hardware where the given error occurred.
Default:
False- -N, --nodes
NODE_COUNT specifies the number of nodes to use. NODE_COUNT must be less or equal to the number of available nodes within the WLM allocation. A value of zero (0) indicates that all available nodes should be used (the default).
- -i, --idle
In conjuction with the –resilient flag, the specifies the number of nodes that will be held in reserve when the user application is run. In the event a node executing the user application experiences an error, the Dragon runtime will pull an “idle” node into the compute pool and begin executing the user application on it.
- -e, --exhaust-resources
When used with –resilient execution, the Dragon runtime will continue executing the user application in the event of any number of localized hardware errors until there are 0 nodes available for computation. If not used, the default behavior of executing until the number of nodes available is less than those requested via the –nodes argument
Default:
False- -T, --telemetry-level
The Dragon runtime enables native and user defined
telemetry. By default, the Dragon runtime disables all telemetry. Passing one of 1, 2, 3, 4, or 5 to this option, the Dragon runtime will enable the specified telemetry verbosity.
Default:
0- -b, --progress-bar
Enables a progress bar for HSTA request completions vs. the total number of expected request completions for the current launch configurarion, which is defined using the values in sys.argv and the number of nodes used for the launch. The first run with this configuration simply collects the necessary information to use a progress bar. Subsequent runs will display the application’s progress via the progress bar. Data collected during the first run will be stored in a file contained in a hidden .dragon directory in the current working directory from which the application was launched. This feature currently requires the use of a parallel file system such as Lustre or NFS.
Default:
False- --no-label
Default:
True- --basic-label
Default:
False- --verbose-label
Default:
False- --version
show program’s version number and exit
dragon-config
Configure the build and runtime environments for Dragon in regards to 3rd party libraries. This is needed for building network backends for HSTA, as well as for GPU support more generally. In future releases, this script may also be used for runtime configuration of libraries. Additionally, some options provide information about the Dragon installation to allow Dragon header files and libraries to be used in compiled applications
usage: dragon-config [-h] [-c] [--config-file CONFIG_FILE] [-s] [-g GET]
[-l | -o | -e]
{add,test} ...
Named Arguments
- -c, --clean
Clean out all config information.
Default:
False- --config-file
Point configuration to a custom config file. Largely intended for testing
- -s, --serialize
Serialize all key-value pairs currently in the configuration file into a single, colon-separated string that can be passed to the –add command.
Default:
False- -g, --get
Get value for given key that can be passed to the –add or –add-mpiexec command.
- -l, --linker-options
For execution during linking, print the linker option for build applications built against Dragon C/C++ API
Default:
False- -o, --compiler-options
For execution during compilation, print the compiler option for building applications built against Dragon C/C++ API
Default:
False- -e, --explicit-compiler-options
With brief description, print the compilation and link options for building C programs with Dragon and exit
Default:
False
Add and tests paths subparser
- add
Possible choices: add, test
Add paths for configuration, compilation, execution, and testing of Dragon
Sub-commands
add
Define a number of paths (key=value) to configure include and library paths for Dragon, or to make the TCP runtime the always-on default for backend communication (set to True).
Examples
UCX backend: dragon-config add –ucx-include=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/include dragon-config add –ucx-build-lib=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/lib dragon-config add –ucx-runtime-lib=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/lib
Set TCP transport as always-on default backend: dragon-config add –tcp-runtime
Set PMIx header files location to enable PMIx support for MPI applications. Specifically looking for path <pmix include>/src/include/pmix_globals.h dragon-config add –pmix-include=/usr/include:/usr/include/pmix
dragon-config add [-h] [--ofi-include OFI_INCLUDE] [--ucx-include UCX_INCLUDE]
[--pmix-include PMIX_INCLUDE] [--mpi-include MPI_INCLUDE]
[--cuda-include CUDA_INCLUDE] [--hip-include HIP_INCLUDE]
[--ze-include ZE_INCLUDE] [--ofi-build-lib OFI_BUILD_LIB]
[--ucx-build-lib UCX_BUILD_LIB]
[--ofi-runtime-lib OFI_RUNTIME_LIB]
[--ucx-runtime-lib UCX_RUNTIME_LIB]
[--cuda-runtime-lib CUDA_RUNTIME_LIB]
[--netconfig-mpiexec-override NETCONFIG_MPIEXEC_OVERRIDE]
[--backend-mpiexec-override BACKEND_MPIEXEC_OVERRIDE]
[--tcp-runtime]
Named Arguments
- --ofi-include
Include path for OFI headers to be used when building dragon
- --ucx-include
Include path for UCX headers to be used when building dragon
- --pmix-include
Include path for PMIx headers to be used when building dragon
- --mpi-include
Include path for MPI headers to be used when building dragon
- --cuda-include
Include path for CUDA headers to be used when building dragon
- --hip-include
Include path for HIP headers to be used when building dragon
- --ze-include
Include path for Ze headers to be used when building dragon
- --ofi-build-lib
Path to OFI libraries (eg: libfabric.so) to be used when building dragon
- --ucx-build-lib
Path to UCX libraries (eg: libucp.so) to be used when building dragon
- --ofi-runtime-lib
Path to OFI libraries (eg: libfabric.so) to be used during app exeuction
- --ucx-runtime-lib
Path to UCX libraries (eg: libucp.so) to be used during app execution
- --cuda-runtime-lib
Path to CUDA libraries (eg: libcudart.so) to be used during app execution
- --netconfig-mpiexec-override
Add mpiexec override commands for Dragon’s PBS+PALS launcher. This is used to add overrides for the mpiexec commands used to launch the network config tool and thedeprecated cleanup processes. The command needs to launch one process per node, line buffer the output, and tag the output with the process rank with some unique identifying information (global rank, hostname, etc). The commands should be passed as a single string. The following special strings are necessary and will be automatically filled in at the time of use by Dragon:
{nnodes} = number of nodes
Examples
Set launcher mpiexec network config override for Cray-PALS: $ dragon-config add –netconfig-mpiexec-override=’mpiexec –np {nnodes} -ppn 1 -l –line-buffer’
Set launcher mpiexec network config override for OpenMPI 5.0.6: $ dragon-config add –netconfig-mpiexec-override=’mpiexec –np {nnodes} –map-by ppr:1:node –stream-buffering=1 –tag-output’
These commands are used by default when the dragon launcher detects PBS+PALS.
To avoid checks with the automatic wlm detection and utilize the overriden mpiexec commands, run dragon with the workload manager specified as ‘–wlm=pbs+pals’.
- --backend-mpiexec-override
Add mpiexec override commands for Dragon’s PBS+PALS launcher. This is used to add overrides for the mpiexec commands used to launch the backend processes. The command should be passed as a single string. The following special strings are necessary and will be automatically filled in at the time of use by Dragon:
{nodes} = number of nodes, {nodelist} = comma separated list of nodes
Examples
Set launcher mpiexec backend launch override for Cray-PALS: $ dragon-config add –backend-mpiexec-override=’mpiexec –np {nnodes} –ppn 1 –cpu-bind none –hosts {nodelist} –line-buffer’
Set launcher mpiexec backend launch override for OpenMPI 5.0.6: $ dragon-config add –backend-mpiexec-override=’mpiexec –np {nnodes} –map-by ppr:1:node –stream-buffering=1 –tag-output –host {nodelist}’
These commands are used by default when the dragon launcher detects PBS+PALS.
To avoid checks with the automatic wlm detection and utilize the overriden mpiexec commands, run dragon with the workload manager specified as ‘–wlm=pbs+pals’.
- --tcp-runtime
If only using TCP for backend communication, set in order to turn off warning message during initialization of runtime
Default:
False
test
Define paths necessary for executing tests of Dragon’s MPI application support
Examples
Set paths for headers and libraries for Cray MPICH, Open MPI, or ANL MPICH installations. dragon-config test –cray-mpich=/opt/cray/pe/lmod/modulefiles/comnet/gnu/12.0/ofi/1.0/cray-mpich dragon-config test –open-mpi=/lus/scratch/dragonhpc/openmpi dragon-config test –anl-mpich=/lus/scratch/dragonhpc/mpich
dragon-config test [-h] [--cray-mpich CRAY_MPICH] [--open-mpi OPEN_MPI]
[--anl-mpich ANL_MPICH]
Named Arguments
- --cray-mpich
Path to Cray MPICH installation
- --open-mpi
Path to Open MPI installation
- --anl-mpich
Path to ANL MPICH installation