Command Line Interface

Dragon provides a couple of different command line interfaces (CLIs) that allows users to interact with the Dragon runtime and its components.

  • dragon - The dragon command is used to start the Dragon runtime services and user applications.

  • dragon-config - The dragon-config is used to set configuration options for the Dragon runtime.

  • dragon-cleanup - The dragon-cleanup command is used to clean up Dragon runtime services and user applications in either a single or multi-node environment.

  • drun - The drun command uses an ssh-tree to run user applications on a set of hostnames.

  • dhosts - The dhosts command opens an interactive shell configured to run applications on a specified list of hostnames.

dragon

Dragon Launcher Arguments and Options

usage: dragon [-h] [--hostlist HOSTLIST | --hostfile HOSTFILE]
              [--network-prefix NETWORK_PREFIX]
              [--network-config NETWORK_CONFIG] [-w WORKLOAD_MANAGER]
              [-p PORT] [--overlay-port OVERLAY_PORT]
              [--frontend-port FRONTEND_PORT] [-t TRANSPORT_AGENT]
              [-o OVERLAY_TRANSPORT_AGENT] [-s | -m] [-l LOG_LEVEL] [-r]
              [-N NODE_COUNT] [-i IDLE_COUNT] [-e] [-T TELEM_LEVEL] [-b]
              [--no-label] [--basic-label] [--verbose-label] [--version]
              [PROG] ...

Positional Arguments

PROG

PROG specifies an executable program to be run on the primary compute node. In this case, the file may be either executable or not. If PROG is not executable, then Python version 3 will be used to interpret it. All command-line arguments after PROG are passed to the program as its arguments. The PROG and ARGS are optional.

ARG

Zero or more program arguments may be specified.

Default: []

Named Arguments

--hostlist

Specify backend hostnames as a comma-separated list, eg: –hostlist host_1,host_2,host_3. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH

--hostfile

Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH

--network-prefix

NETWORK_PREFIX specifies the network prefix the dragon runtime will use to determine which IP addresses it should use to build multinode connections from. By default the regular expression r’^(hsn|ipogif|ib)d+$’ is used – the prefix for known HPE-Cray XC and EX high speed networks. If uncertain which networks are available, the following will return them in pretty formatting: dragon-network-ifaddrs –ip –no-loopback –up –running | jq. Prepending with srun may be necessary to get networks available on backend compute nodes

--network-config

NETWORK_CONFIG specifies a YAML or JSON file generated via a call to the launcher’s network config tool that successfully generated a corresponding YAML or JSON file (eg: dragon-network-config –output-to-yaml) describing the available backend compute nodes specified either by a workload manager (this is what the tool provides). Alternatively, one can be generated manually as is needed in the case of ssh-only launch. An example with keywords and formatting can be found in the documentation

-w, --wlm

Possible choices: slurm, pbs+pals, ssh, k8s, drun

Specify what workload manager is used. Currently supported WLMs are: slurm, pbs+pals, ssh, k8s, drun

-p, --port

PORT specifies the port to be used for multinode communication. By default, 7575 is used.

--overlay-port

OVERLAY_PORT specifies the port to be used for the dragon overlay network communication. By default, 6565 is used.

--frontend-port

FRONTEND_PORT specifies the port to be used by the Overlay transport agent running on the Dragon frontend node. By default, 6566 is used.

-t, --transport

Possible choices: hsta, tcp, configured

TRANSPORT_AGENT selects which transport agent will be used for backend node-to-node communication. In the absence of a dragon-hsta binrary, the TCP agent will be used. Currently supported agents are: hsta, tcp

Default: configured

-o, --overlay-transport

Possible choices: hsta, tcp, configured

OVERLAY_TRANSPORT_AGENT selects which transport agent will be used for node-to-node communication on the overlay network, connecting the frontend to the backend nodes. In the absence of a dragon-hsta binary, the TCP agent will be used. Currently supported agents are: hsta, tcp

Default: configured

-s, --single-node-override

Override automatic launcher selection to force use of the single node launcher

Default: False

-m, --multi-node-override

Override automatic launcher selection to force use of the multi-node launcher

Default: False

-l, --log-level

Possible choices: NONE, DEBUG, INFO, WARNING, ERROR, CRITICAL, stderr=NONE, stderr=DEBUG, stderr=INFO, stderr=WARNING, stderr=ERROR, stderr=CRITICAL, dragon_file=NONE, dragon_file=DEBUG, dragon_file=INFO, dragon_file=WARNING, dragon_file=ERROR, dragon_file=CRITICAL, actor_file=NONE, actor_file=DEBUG, actor_file=INFO, actor_file=WARNING, actor_file=ERROR, actor_file=CRITICAL

The Dragon runtime enables the output of diagnostic log messages to multiple different output devices. Diagnotic log messages can be seen on the Dragon stderr console, via a combined ‘dragon_*.log’ file, or via individual log files created by each of the Dragon ‘actors’ (Global Services, Local Services, etc).

By default, the Dragon runtime disables all diagnostic log messaging.

Passing one of NONE, DEBUG, INFO, WARNING, ERROR, or CRITICAL to this option, the Dragon runtime will enable the specified log verbosity. When enabling DEBUG level logging, the Dragon runtime will limit the stderr and combined dragon log file to INFO level messages. Each actor’s log file will contain the complete log history, including DEBUG messages. This is done to help limit the number of messages sent between the Dragon frontend and the Dragon backend at scale.

To override the default logging behavior and enable specific logging to one or more Dragon output devices, the LOG_LEVEL option can be formatted as a keyword=value pair, where the KEYWORD is one of the Dragon log output devices (stderr, dragon_file or actor_file), and the VALUE is one of NONE, DEBUG, INFO, WARNING, ERROR or CRITICAL (eg -l dragon_file=INFO -l actor_file=DEBUG). Multiple -l|–log-level options may be passed to enable the logging desired.

Default: {'DRAGON_LOG_DEVICE_STDERR': 'NONE', 'DRAGON_LOG_DEVICE_DRAGON_FILE': 'NONE', 'DRAGON_LOG_DEVICE_ACTOR_FILE': 'NONE'}

-r, --resilient

If used, the Dragon runtime will attempt to continue execution of the user app in the event of a hardware or user software error by falling back to functional hardware resources and omitting hardware where the given error occurred.

Default: False

-N, --nodes

NODE_COUNT specifies the number of nodes to use. NODE_COUNT must be less or equal to the number of available nodes within the WLM allocation. A value of zero (0) indicates that all available nodes should be used (the default).

-i, --idle

In conjuction with the –resilient flag, the specifies the number of nodes that will be held in reserve when the user application is run. In the event a node executing the user application experiences an error, the Dragon runtime will pull an “idle” node into the compute pool and begin executing the user application on it.

-e, --exhaust-resources

When used with –resilient execution, the Dragon runtime will continue executing the user application in the event of any number of localized hardware errors until there are 0 nodes available for computation. If not used, the default behavior of executing until the number of nodes available is less than those requested via the –nodes argument

Default: False

-T, --telemetry-level

The Dragon runtime enables native and user defined

telemetry. By default, the Dragon runtime disables all telemetry. Passing one of 1, 2, 3, 4, or 5 to this option, the Dragon runtime will enable the specified telemetry verbosity.

Default: 0

-b, --progress-bar

Enables a progress bar for HSTA request completions vs. the total number of expected request completions for the current launch configurarion, which is defined using the values in sys.argv and the number of nodes used for the launch. The first run with this configuration simply collects the necessary information to use a progress bar. Subsequent runs will display the application’s progress via the progress bar. Data collected during the first run will be stored in a file contained in a hidden .dragon directory in the current working directory from which the application was launched. This feature currently requires the use of a parallel file system such as Lustre or NFS.

Default: False

--no-label

Default: True

--basic-label

Default: False

--verbose-label

Default: False

--version

show program’s version number and exit

dragon-config

Configure the build and runtime environments for Dragon in regards to 3rd party libraries. This is needed for building network backends for HSTA, as well as for GPU support more generally. In future releases, this script may also be used for runtime configuration of libraries. Additionally, some options provide information about the Dragon installation to allow Dragon header files and libraries to be used in compiled applications

usage: dragon-config [-h] [-c] [--config-file CONFIG_FILE] [-s] [-g GET]
                     [-l | -o | -e]
                     {add,test} ...

Named Arguments

-c, --clean

Clean out all config information.

Default: False

--config-file

Point configuration to a custom config file. Largely intended for testing

-s, --serialize

Serialize all key-value pairs currently in the configuration file into a single, colon-separated string that can be passed to the –add command.

Default: False

-g, --get

Get value for given key that can be passed to the –add or –add-mpiexec command.

-l, --linker-options

For execution during linking, print the linker option for build applications built against Dragon C/C++ API

Default: False

-o, --compiler-options

For execution during compilation, print the compiler option for building applications built against Dragon C/C++ API

Default: False

-e, --explicit-compiler-options

With brief description, print the compilation and link options for building C programs with Dragon and exit

Default: False

Add and tests paths subparser

add

Possible choices: add, test

Add paths for configuration, compilation, execution, and testing of Dragon

Sub-commands

add

Define a number of paths (key=value) to configure include and library paths for Dragon, or to make the TCP runtime the always-on default for backend communication (set to True).

Examples

UCX backend: dragon-config add –ucx-include=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/include dragon-config add –ucx-build-lib=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/lib dragon-config add –ucx-runtime-lib=/opt/nvidia/hpc_sdk/Linux_x86_64/23.11/comm_libs/12.3/hpcx/hpcx-2.16/ucx/prof/lib

Set TCP transport as always-on default backend: dragon-config add –tcp-runtime

Set PMIx header files location to enable PMIx support for MPI applications. Specifically looking for path <pmix include>/src/include/pmix_globals.h dragon-config add –pmix-include=/usr/include:/usr/include/pmix

dragon-config add [-h] [--ofi-include OFI_INCLUDE] [--ucx-include UCX_INCLUDE]
                  [--pmix-include PMIX_INCLUDE] [--mpi-include MPI_INCLUDE]
                  [--cuda-include CUDA_INCLUDE] [--hip-include HIP_INCLUDE]
                  [--ze-include ZE_INCLUDE] [--ofi-build-lib OFI_BUILD_LIB]
                  [--ucx-build-lib UCX_BUILD_LIB]
                  [--ofi-runtime-lib OFI_RUNTIME_LIB]
                  [--ucx-runtime-lib UCX_RUNTIME_LIB]
                  [--cuda-runtime-lib CUDA_RUNTIME_LIB]
                  [--netconfig-mpiexec-override NETCONFIG_MPIEXEC_OVERRIDE]
                  [--backend-mpiexec-override BACKEND_MPIEXEC_OVERRIDE]
                  [--tcp-runtime]
Named Arguments
--ofi-include

Include path for OFI headers to be used when building dragon

--ucx-include

Include path for UCX headers to be used when building dragon

--pmix-include

Include path for PMIx headers to be used when building dragon

--mpi-include

Include path for MPI headers to be used when building dragon

--cuda-include

Include path for CUDA headers to be used when building dragon

--hip-include

Include path for HIP headers to be used when building dragon

--ze-include

Include path for Ze headers to be used when building dragon

--ofi-build-lib

Path to OFI libraries (eg: libfabric.so) to be used when building dragon

--ucx-build-lib

Path to UCX libraries (eg: libucp.so) to be used when building dragon

--ofi-runtime-lib

Path to OFI libraries (eg: libfabric.so) to be used during app exeuction

--ucx-runtime-lib

Path to UCX libraries (eg: libucp.so) to be used during app execution

--cuda-runtime-lib

Path to CUDA libraries (eg: libcudart.so) to be used during app execution

--netconfig-mpiexec-override

Add mpiexec override commands for Dragon’s PBS+PALS launcher. This is used to add overrides for the mpiexec commands used to launch the network config tool and thedeprecated cleanup processes. The command needs to launch one process per node, line buffer the output, and tag the output with the process rank with some unique identifying information (global rank, hostname, etc). The commands should be passed as a single string. The following special strings are necessary and will be automatically filled in at the time of use by Dragon:

{nnodes} = number of nodes

Examples

Set launcher mpiexec network config override for Cray-PALS: $ dragon-config add –netconfig-mpiexec-override=’mpiexec –np {nnodes} -ppn 1 -l –line-buffer’

Set launcher mpiexec network config override for OpenMPI 5.0.6: $ dragon-config add –netconfig-mpiexec-override=’mpiexec –np {nnodes} –map-by ppr:1:node –stream-buffering=1 –tag-output’

These commands are used by default when the dragon launcher detects PBS+PALS.

To avoid checks with the automatic wlm detection and utilize the overriden mpiexec commands, run dragon with the workload manager specified as ‘–wlm=pbs+pals’.

--backend-mpiexec-override

Add mpiexec override commands for Dragon’s PBS+PALS launcher. This is used to add overrides for the mpiexec commands used to launch the backend processes. The command should be passed as a single string. The following special strings are necessary and will be automatically filled in at the time of use by Dragon:

{nodes} = number of nodes, {nodelist} = comma separated list of nodes

Examples

Set launcher mpiexec backend launch override for Cray-PALS: $ dragon-config add –backend-mpiexec-override=’mpiexec –np {nnodes} –ppn 1 –cpu-bind none –hosts {nodelist} –line-buffer’

Set launcher mpiexec backend launch override for OpenMPI 5.0.6: $ dragon-config add –backend-mpiexec-override=’mpiexec –np {nnodes} –map-by ppr:1:node –stream-buffering=1 –tag-output –host {nodelist}’

These commands are used by default when the dragon launcher detects PBS+PALS.

To avoid checks with the automatic wlm detection and utilize the overriden mpiexec commands, run dragon with the workload manager specified as ‘–wlm=pbs+pals’.

--tcp-runtime

If only using TCP for backend communication, set in order to turn off warning message during initialization of runtime

Default: False

test

Define paths necessary for executing tests of Dragon’s MPI application support

Examples

Set paths for headers and libraries for Cray MPICH, Open MPI, or ANL MPICH installations. dragon-config test –cray-mpich=/opt/cray/pe/lmod/modulefiles/comnet/gnu/12.0/ofi/1.0/cray-mpich dragon-config test –open-mpi=/lus/scratch/dragonhpc/openmpi dragon-config test –anl-mpich=/lus/scratch/dragonhpc/mpich

dragon-config test [-h] [--cray-mpich CRAY_MPICH] [--open-mpi OPEN_MPI]
                   [--anl-mpich ANL_MPICH]
Named Arguments
--cray-mpich

Path to Cray MPICH installation

--open-mpi

Path to Open MPI installation

--anl-mpich

Path to ANL MPICH installation

dragon-cleanup

The dragon-cleanup tool identifies and/or removes residual dragon runtime services and user applications from previous runs in either a single or multi-node environment. This is particularly useful when an execution of Dragon fails to exit cleanly, leaving behind orphaned processes or resources that could interfere with subsequent runs.

The tool automatically detects if a Workload Manager (WLM)—such as Slurm or PBS—was used for node allocation. If a WLM is present, dragon-cleanup targets those active nodes. Nodes can also be manually specified via the –hostlist or –hostfile arguments, or leverage the dhosts utility to set the DRAGON_RUN_NODEFILE environment variable.

To ensure efficiency across large multi-node environments, dragon-cleanup utilizes an SSH-tree to launch cleanup processes on each node. This requires that all nodes are configured for password-less SSH and maintain mutual routability.

Example usage:

dragon-cleanup –hostlist host1,host2,host3

Manually specify that dragon-cleanup should run on host1, host2 and host3.

dragon-cleanup –hostfile my_hostfile.txt –dry-run

Specify that dragon-cleanup should run in dry-run mode on the hosts specified in my_hostfile.txt. In dry-run mode, dragon-cleanup will print the processes and resources it would clean up, but won’t actually make any changes.

dragon-cleanup –wlm slurm

Force dragon-cleanup to look for an active Slurm allocation and run on the nodes from that allocation.

usage: dragon-cleanup [-h] [--wlm WORKLOAD_MANAGER]
                      [--hostlist HOST_LIST | --hostfile HOST_LIST] [-s | -m]
                      [--dry-run] [--resilient] [--timeout TIMEOUT]
                      [--only-be]

Named Arguments

--wlm

Possible choices: slurm, pbs, ssh

Specify what workload manager is used. Currently supported WLMs are: slurm, pbs, ssh

--hostlist

Specify backend hostnames as a comma-separated list, eg: –hostlist host_1,host_2,host_3. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH

Default: []

--hostfile

Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH

Default: []

-s, --single-node-override

Override automatic launcher selection to force use of the single node launcher

Default: False

-m, --multi-node-override

Override automatic launcher selection to force use of the multi-node launcher

Default: False

--dry-run

Dry run. Don’t actually make changes

Default: False

--resilient

Prevent removing resources to enable resilient restart of the Dragon runtime

Default: False

--timeout

Time to wait when terminating a process before killing it.

Default: 2

--only-be

Only teardown Dragon processes on backend compute nodes.

Default: False

drun

The DragonRun (drun) utility is used to launch applications on a set of hosts.

The tool automatically detects if a Workload Manager (WLM)—such as Slurm or PBS—was used for node allocation. If a WLM is present, drun targets those active nodes. Nodes can also be manually specified via the –hostlist or –hostfile arguments, or leverage the dhosts utility to set the DRAGON_RUN_NODEFILE environment variable.

To ensure efficiency across large multi-node environments, drun utilizes an SSH-tree to launch processes on each node. This requires that all nodes are configured for password-less SSH and maintain mutual routability.

Example usage:

drun –hostlist host1,host2,host3 my_executable –option1 –option2

Manually specify a list of hosts, in this case, host1, host2 and host3, on which to run my_executable with options –option1 and –option2.

drun –hostfile my_hostfile.txt my_executable

Specify a file containing a list of hosts, in this case, my_hostfile.txt, on which to run my_executable.

drun –wlm slurm my_executable

Force drun to look for an active Slurm allocation and use the nodes from that allocation to run my_executable.

usage: drun [-h] [--wlm WORKLOAD_MANAGER]
            [--hostlist HOST_LIST | --hostfile HOST_LIST] [-s | -m]
            [--export {ALL,NONE}] [--env KEY=VALUE] [--include-fe]
            [--fanout FANOUT] [-l LOG_LEVEL]
            ...

Positional Arguments

USER_CMD

The executable, including any command line options, to execute on the remote nodes.

Default: []

Named Arguments

--wlm

Possible choices: slurm, pbs, ssh

Specify what workload manager is used. Currently supported WLMs are: slurm, pbs, ssh

--hostlist

Specify backend hostnames as a comma-separated list, eg: –hostlist host_1,host_2,host_3. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH

Default: []

--hostfile

Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH

Default: []

-s, --single-node-override

Override automatic launcher selection to force use of the single node launcher

Default: False

-m, --multi-node-override

Override automatic launcher selection to force use of the multi-node launcher

Default: False

--export

Possible choices: ALL, NONE

Identify which environment variables from the submission environment are propagated to the launched application.

Default: 'NONE'

--env

Environment variables to set in the remote environment. Example: –env DEBUG=True

Default: {}

--include-fe

In addition to running the given command on the dragon backend node, also run the command on the dragon frontend.

Default: False

--fanout

DragonRun uses a fanout tree to effeciently communicate with its backend nodes. This value sets the number of children each node in this fanout tree talks to.

Default: 16

-l, --log-level

Possible choices: CRITICAL, FATAL, ERROR, WARN, WARNING, INFO, DEBUG, NOTSET

Enables the output of diagnostic log messages. By default, the DragonRun runtime disables all diagnostic log messaging. Passing one of NOTSET, DEBUG, INFO, WARNING, ERROR, or CRITICAL to this option, the Dragon runtime will enable the specified log verbosity.

Default: 'NOTSET'

dhosts

The dhosts utility defines the list of hosts that should be used by other Dragon runtime tools. To do this, dhosts generates a temporary hostfile and exports the DRAGON_RUN_NODEFILE environment variable within a subshell. To generate the host list, dhosts first attempts to detect an active Workload Manager (WLM) allocation, such as from Slurm or PBS. If no WLM is present, or if dhosts is unable to detect the allocated nodes from the WLM, the list of hosts can be specified manually via the –hostlist or –hostfile options.

This is useful for running other dragon tools on a specific set of hosts without having to specify the list of hosts to each tool individually. Since dhosts exports the DRAGON_RUN_NODEFILE environment variable, any tool that relies on this environment variable can automatically use the generated hostlist. For example, dragon-cleanup will automatically use the hostlist generated by dhosts if DRAGON_RUN_NODEFILE is set in the environment.

To ensure efficiency across large multi-node environments, the DragonRun (drun) launcher utilizes an SSH-tree to launch processes on each node. This requires that all nodes are configured for password-less SSH and maintain mutual routability.

Example usage:

dhosts –hostlist host1,host2,host3

Manually specify a list of hosts, in this case, host1, host2 and host3.

dhosts –hostfile my_hostfile.txt

Specify a file containing a list of hosts, in this case, my_hostfile.txt.

dhosts –wlm slurm

Force dhosts to look for an active Slurm allocation and use the nodes from that allocation.

dhosts –list

Print the list of hosts that dhosts has determined should be used in the current environment.

usage: dhosts [-h] [--wlm WORKLOAD_MANAGER]
              [--hostlist HOST_LIST | --hostfile HOST_LIST] [--list]

Named Arguments

--wlm

Possible choices: slurm, pbs, ssh

Specify what workload manager is used. Currently supported WLMs are: slurm, pbs, ssh

--hostlist

Specify backend hostnames as a comma-separated list, eg: –hostlist host_1,host_2,host_3. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH

Default: []

--hostfile

Specify a list of hostnames to connect to via SSH launch. The file should be a newline character separated list of hostnames. –hostfile or –hostlist is a required argument for WLM SSH and is only used for SSH

Default: []

--list

List known hosts in the current dragon run environment

Default: False