Command Line Interface

Dragon provides a couple of different command line interfaces (CLIs) that allows users to interact with the Dragon runtime and its components.

dragon - The dragon command is used to start the Dragon runtime services and user applications.
dragon-config - The dragon-config is used to set configuration options for the Dragon runtime.

dragon

Dragon Launcher Arguments and Options

usage: dragon [-h] [--hostlist HOSTLIST | --hostfile HOSTFILE]
              [--network-prefix NETWORK_PREFIX]
              [--network-config NETWORK_CONFIG] [--wlm WORKLOAD_MANAGER]
              [-p PORT] [--overlay-port OVERLAY_PORT]
              [--frontend-port FRONTEND_PORT] [--transport TRANSPORT_AGENT]
              [-s | -m] [-l LOG_LEVEL] [-r] [-N NODE_COUNT] [-i IDLE_COUNT]
              [-e] [-T TELEM_LEVEL] [-b] [--no-label] [--basic-label]
              [--verbose-label] [--version]
              [PROG] ...

Positional Arguments

PROG

PROG specifies an executable program to be run on the primary compute node. In this case, the file may be either executable or not. If PROG is not executable, then Python version 3 will be used to interpret it. All command-line arguments after PROG are passed to the program as its arguments. The PROG and ARGS are optional.

ARG

Zero or more program arguments may be specified.

Default: []

Named Arguments

--hostlist

Specify backend hostnames as a comma-separated list, eg: –hostlist host_1,host_2,host_3. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH

--hostfile

Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH

--network-prefix

NETWORK_PREFIX specifies the network prefix the dragon runtime will use to determine which IP addresses it should use to build multinode connections from. By default the regular expression r’^(hsn|ipogif|ib)d+$’ is used – the prefix for known HPE-Cray XC and EX high speed networks. If uncertain which networks are available, the following will return them in pretty formatting: dragon-network-ifaddrs –ip –no-loopback –up –running | jq. Prepending with srun may be necessary to get networks available on backend compute nodes

--network-config

NETWORK_CONFIG specifies a YAML or JSON file generated via a call to the launcher’s network config tool that successfully generated a corresponding YAML or JSON file (eg: dragon-network-config –output-to-yaml) describing the available backend compute nodes specified either by a workload manager (this is what the tool provides). Alternatively, one can be generated manually as is needed in the case of ssh-only launch. An example with keywords and formatting can be found in the documentation

--wlm, -w

Possible choices: slurm, pbs+pals, ssh, k8s, drun

Specify what workload manager is used. Currently supported WLMs are: slurm, pbs+pals, ssh, k8s, drun

-p, --port

PORT specifies the port to be used for multinode communication. By default, 7575 is used.

--overlay-port

OVERLAY_PORT specifies the port to be used for the dragon overlay network communication. By default, 6565 is used.

--frontend-port

FRONTEND_PORT specifies the port to be used by the Overlay transport agent running on the Dragon frontend node. By default, 6566 is used.

--transport, -t

Possible choices: hsta, tcp, configured

TRANSPORT_AGENT selects which transport agent will be used for backend node-to-node communication. By default, Dragon consults the files created by running dragon-config. Run dragon-config –help for more information. In the absence of dragon-config files the TCP agent will be used. Currently supported agents are: hsta, tcp

Default: configured

-s, --single-node-override

Override automatic launcher selection to force use of the single node launcher

Default: False

-m, --multi-node-override

Override automatic launcher selection to force use of the multi-node launcher

Default: False

-l, --log-level

Possible choices: NONE, DEBUG, INFO, WARNING, ERROR, CRITICAL, stderr=NONE, stderr=DEBUG, stderr=INFO, stderr=WARNING, stderr=ERROR, stderr=CRITICAL, dragon_file=NONE, dragon_file=DEBUG, dragon_file=INFO, dragon_file=WARNING, dragon_file=ERROR, dragon_file=CRITICAL, actor_file=NONE, actor_file=DEBUG, actor_file=INFO, actor_file=WARNING, actor_file=ERROR, actor_file=CRITICAL

The Dragon runtime enables the output of diagnostic log messages to multiple different output devices. Diagnotic log messages can be seen on the Dragon stderr console, via a combined ‘dragon_*.log’ file, or via individual log files created by each of the Dragon ‘actors’ (Global Services, Local Services, etc).

By default, the Dragon runtime disables all diagnostic log messaging.

Passing one of NONE, DEBUG, INFO, WARNING, ERROR, or CRITICAL to this option, the Dragon runtime will enable the specified log verbosity. When enabling DEBUG level logging, the Dragon runtime will limit the stderr and combined dragon log file to INFO level messages. Each actor’s log file will contain the complete log history, including DEBUG messages. This is done to help limit the number of messages sent between the Dragon frontend and the Dragon backend at scale.

To override the default logging behavior and enable specific logging to one or more Dragon output devices, the LOG_LEVEL option can be formatted as a keyword=value pair, where the KEYWORD is one of the Dragon log output devices (stderr, dragon_file or actor_file), and the VALUE is one of NONE, DEBUG, INFO, WARNING, ERROR or CRITICAL (eg -l dragon_file=INFO -l actor_file=DEBUG). Multiple -l|–log-level options may be passed to enable the logging desired.

Default: {'DRAGON_LOG_DEVICE_STDERR': 'NONE', 'DRAGON_LOG_DEVICE_DRAGON_FILE': 'NONE', 'DRAGON_LOG_DEVICE_ACTOR_FILE': 'NONE'}

-r, --resilient

If used, the Dragon runtime will attempt to continue execution of the user app in the event of a hardware or user software error by falling back to functional hardware resources and omitting hardware where the given error occurred.

Default: False

-N, --nodes

NODE_COUNT specifies the number of nodes to use. NODE_COUNT must be less or equal to the number of available nodes within the WLM allocation. A value of zero (0) indicates that all available nodes should be used (the default).

-i, --idle

In conjuction with the –resilient flag, the specifies the number of nodes that will be held in reserve when the user application is run. In the event a node executing the user application experiences an error, the Dragon runtime will pull an “idle” node into the compute pool and begin executing the user application on it.

-e, --exhaust-resources

When used with –resilient execution, the Dragon runtime will continue executing the user application in the event of any number of localized hardware errors until there are 0 nodes available for computation. If not used, the default behavior of executing until the number of nodes available is less than those requested via the –nodes argument

Default: False

-T, --telemetry-level

The Dragon runtime enables native and user defined

telemetry. By default, the Dragon runtime disables all telemetry. Passing one of 1, 2, 3, 4, or 5 to this option, the Dragon runtime will enable the specified telemetry verbosity.

Default: 0

-b, --progress-bar

Enables a progress bar for HSTA request completions vs. the total number of expected request completions for the current launch configurarion, which is defined using the values in sys.argv and the number of nodes used for the launch. The first run with this configuration simply collects the necessary information to use a progress bar. Subsequent runs will display the application’s progress via the progress bar. Data collected during the first run will be stored in a file contained in a hidden .dragon directory in the current working directory from which the application was launched. This feature currently requires the use of a parallel file system such as Lustre or NFS.

Default: False

--no-label

Default: True

--basic-label

Default: False

--verbose-label

Default: False

--version

show program’s version number and exit

dragon-config

Configure the build and runtime environments for Dragon in regards to 3rd party libraries. This is needed for building network backends for HSTA, as well as for GPU support more generally. In future releases, this script may also be used for runtime configuration of libraries. Additionally, some options provide information about the Dragon installation to allow Dragon header files and libraries to be used in compiled applications

usage: dragon-config [-h] [-c] [--config-file CONFIG_FILE] [-s] [-g GET]
                     [-l | -o | -e]
                     {add,test} ...

Named Arguments

-c, --clean

Clean out all config information.

Default: False

--config-file

Point configuration to a custom config file. Largely intended for testing

-s, --serialize

Serialize all key-value pairs currently in the configuration file into a single, colon-separated string that can be passed to the –add command.

Default: False

-g, --get

Get value for given key that can be passed to the –add or –add-mpiexec command.

-l, --linker-options

For execution during linking, print the linker option for build applications built against Dragon C/C++ API

Default: False

-o, --compiler-options

For execution during compilation, print the compiler option for building applications built against Dragon C/C++ API

Default: False

-e, --explicit-compiler-options

With brief description, print the compilation and link options for building C programs with Dragon and exit

Default: False

Add and tests paths subparser

add

Possible choices: add, test

Add paths for configuration, compilation, execution, and testing of Dragon

Sub-commands

add

Define a number of paths (key=value) to configure include and library paths for Dragon, or to make the TCP runtime the always-on default for backend communication (set to True).

Examples

UCX backend: dragon-config add –ucx-include=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/include dragon-config add –ucx-build-lib=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/lib dragon-config add –ucx-runtime-lib=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/lib

Set TCP transport as always-on default backend: dragon-config add –tcp-runtime

Set PMIx header files location to enable PMIx support for MPI applications. Specifically looking for path <pmix include>/src/include/pmix_globals.h dragon-config add –pmix-include=/usr/include:/usr/include/pmix

dragon-config add [-h] [--ofi-include OFI_INCLUDE] [--ucx-include UCX_INCLUDE]
                  [--pmix-include PMIX_INCLUDE] [--mpi-include MPI_INCLUDE]
                  [--cuda-include CUDA_INCLUDE] [--hip-include HIP_INCLUDE]
                  [--ze-include ZE_INCLUDE] [--ofi-build-lib OFI_BUILD_LIB]
                  [--ucx-build-lib UCX_BUILD_LIB]
                  [--ofi-runtime-lib OFI_RUNTIME_LIB]
                  [--ucx-runtime-lib UCX_RUNTIME_LIB]
                  [--cuda-runtime-lib CUDA_RUNTIME_LIB]
                  [--netconfig-mpiexec-override NETCONFIG_MPIEXEC_OVERRIDE]
                  [--backend-mpiexec-override BACKEND_MPIEXEC_OVERRIDE]
                  [--tcp-runtime]

Named Arguments

--ofi-include

Include path for OFI headers to be used when building dragon

--ucx-include

Include path for UCX headers to be used when building dragon

--pmix-include

Include path for PMIx headers to be used when building dragon

--mpi-include

Include path for MPI headers to be used when building dragon

--cuda-include

Include path for CUDA headers to be used when building dragon

--hip-include

Include path for HIP headers to be used when building dragon

--ze-include

Include path for Ze headers to be used when building dragon

--ofi-build-lib

Path to OFI libraries (eg: libfabric.so) to be used when building dragon

--ucx-build-lib

Path to UCX libraries (eg: libucp.so) to be used when building dragon

--ofi-runtime-lib

Path to OFI libraries (eg: libfabric.so) to be used during app exeuction

--ucx-runtime-lib

Path to UCX libraries (eg: libucp.so) to be used during app execution

--cuda-runtime-lib

Path to CUDA libraries (eg: libcudart.so) to be used during app execution

--netconfig-mpiexec-override

Add mpiexec override commands for Dragon’s PBS+PALS launcher. This is used to add overrides for the mpiexec commands used to launch the network config tool and thedeprecated cleanup processes. The command needs to launch one process per node, line buffer the output, and tag the output with the process rank with some unique identifying information (global rank, hostname, etc). The commands should be passed as a single string. The following special strings are necessary and will be automatically filled in at the time of use by Dragon:

{nnodes} = number of nodes

Examples

Set launcher mpiexec network config override for Cray-PALS: $ dragon-config add –netconfig-mpiexec-override=’mpiexec –np {nnodes} -ppn 1 -l –line-buffer’

Set launcher mpiexec network config override for OpenMPI 5.0.6: $ dragon-config add –netconfig-mpiexec-override=’mpiexec –np {nnodes} –map-by ppr:1:node –stream-buffering=1 –tag-output’

These commands are used by default when the dragon launcher detects PBS+PALS.

To avoid checks with the automatic wlm detection and utilize the overriden mpiexec commands, run dragon with the workload manager specified as ‘–wlm=pbs+pals’.

--backend-mpiexec-override

Add mpiexec override commands for Dragon’s PBS+PALS launcher. This is used to add overrides for the mpiexec commands used to launch the backend processes. The command should be passed as a single string. The following special strings are necessary and will be automatically filled in at the time of use by Dragon:

{nodes} = number of nodes, {nodelist} = comma separated list of nodes

Examples

Set launcher mpiexec backend launch override for Cray-PALS: $ dragon-config add –backend-mpiexec-override=’mpiexec –np {nnodes} –ppn 1 –cpu-bind none –hosts {nodelist} –line-buffer’

Set launcher mpiexec backend launch override for OpenMPI 5.0.6: $ dragon-config add –backend-mpiexec-override=’mpiexec –np {nnodes} –map-by ppr:1:node –stream-buffering=1 –tag-output –host {nodelist}’

These commands are used by default when the dragon launcher detects PBS+PALS.

To avoid checks with the automatic wlm detection and utilize the overriden mpiexec commands, run dragon with the workload manager specified as ‘–wlm=pbs+pals’.

--tcp-runtime

If only using TCP for backend communication, set in order to turn off warning message during initialization of runtime

Default: False

test

Define paths necessary for executing tests of Dragon’s MPI application support

Examples

Set paths for headers and libraries for Cray MPICH, Open MPI, or ANL MPICH installations. dragon-config test –cray-mpich=/opt/cray/pe/lmod/modulefiles/comnet/gnu/12.0/ofi/1.0/cray-mpich dragon-config test –open-mpi=/lus/scratch/dragonhpc/openmpi dragon-config test –anl-mpich=/lus/scratch/dragonhpc/mpich

dragon-config test [-h] [--cray-mpich CRAY_MPICH] [--open-mpi OPEN_MPI]
                   [--anl-mpich ANL_MPICH]

Named Arguments

--cray-mpich: Path to Cray MPICH installation
--open-mpi: Path to Open MPI installation
--anl-mpich: Path to ANL MPICH installation