.. _MultiNodeDeployment:

Multi Node Deployment
+++++++++++++++++++++

This section outlines multi-node bringup and teardown procedure for multi-node
systems.

Multi node launches start from a frontend node that the user is assumed to have a
command prompt on. This frontend can be co-located with backend nodes
or run from its own node. All off-node communication that is initiated by the
:ref:`LocalServices` goes through :ref:`GlobalServices`. Local
Services itself has a one-node view of the world while Global Services does the
work of communicating off node when necessary. :numref:`deploy-multi-node` and :numref:`multi-node-overview` depict a
multi-node version of the Dragon :ref:`Services`.

.. figure:: images/deployment_multi_node.svg
    :name: deploy-multi-node 

    **Startup Overview**

.. figure:: images/multinodeoverview.png
    :scale: 30%
    :name: multi-node-overview 

    **Multi-Node Overview of Dragon Services**

.. _MultiNodeBringup:

Multi Node Bringup
==================

.. _NetworkConfiguration:

Network Configuration
---------------------

The launcher frontend must know what resources are available for its use on the
compute backend. To obtain that information for a given set of workload
managers, there is a network config tool in the launcher module. This tool is
exposed for general use. However, if deploying dragon on a supported workload
manager with an active job allocation, the launcher frontend will handle
obtaining the network configuration. It exists in `dragon.launcher.network_config.main`,
but can be invoked directly via `dragon-network-config`. Its help is below:

.. autodocstringonly:: dragon.launcher.network_config.main

If output to file (YAML or JSON are supported), the file can be provided to the
launcher frontend at launch. Formatting of the files appears below:

.. code-block:: YAML
  :linenos:
  :caption: **Example of YAML formatted network configuration file**

  '0':
    h_uid: null
    host_id: 18446744071562724608
    ip_addrs:
    - 10.128.0.5:6565
    is_primary: true
    name: nid00004
    num_cpus: 0
    physical_mem: 0
    shep_cd: ''
    state: 4
  '1':
    h_uid: null
    host_id: 18446744071562724864
    ip_addrs:
    - 10.128.0.6:6565
    is_primary: false
    name: nid00005
    num_cpus: 0
    physical_mem: 0
    shep_cd: ''
    state: 4

.. code-block:: JSON
  :linenos:
  :caption: **Example of JSON formatted network configuration file**

  {
    "0": {
          "state": 4,
          "h_uid": null,
          "name": "nid00004",
          "is_primary": true,
          "ip_addrs": [
              "10.128.0.5:6565"
          ],
          "host_id": 18446744071562724608,
          "num_cpus": 0,
          "physical_mem": 0,
          "shep_cd": ""
      },
      "1": {
          "state": 4,
          "h_uid": null,
          "name": "nid00005",
          "is_primary": false,
          "ip_addrs": [
              "10.128.0.6:6565"
          ],
          "host_id": 18446744071562724864,
          "num_cpus": 0,
          "physical_mem": 0,
          "shep_cd": ""
      }
  }


Launching
---------

The :ref:`LocalServices` and :ref:`GlobalServices` are instantiated when a
program is launched by the Dragon :ref:`Launcher`. There is one Local Services
instance on each node and one Global Services instance in the entire job.
Since all :ref:`Services` run as user-level services (i.e. not with superuser
authority), the services described here are assumed to be one per launched user
program.

The multi-node bring-up sequence is given in :numref:`startup-seq-multinode` and in the section titled
:ref:`MultiNodeBringup` where the message descriptions are also provided. The
Launcher Frontend brings up an instance of the Launcher Backend on each node.
Each launcher (frontend and backend) then brings up an instance of the TCP
Transport Agent which serves to create an Overlay Tree for communicating
infrastructure-related messages.

The Launcher Backend then brings up Local Services. The Backend
forwards messages from the Launcher Frontend to Local Services. Local Services forwards
output from the user program to the Frontend through the Backend.


Sequence diagram
-------------------

The diagram below depicts the message flow in the multi-node startup sequence.

.. raw:: html
    :file: images/startup_seq_multi_node.svg

**Sequence diagram of Dragon multi-node bringup**

Notes on Bring-up Sequence
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. _launch_net_config:

Get Network Configuration (A1-A3)
    Launch 1 instance of network config tool on each backend compute node.
    Each instance will use other dragon tools to determine a preferred IP address
    for frontend and its unique Host ID that will be used in routing
    messages to and from it.

    Each network config instance will line buffer the network information to stdout
    as a serialized JSON structure. The frontend will consume this information.

    The config tool will then exit.

.. _wlm-launch:

Workload Manager Launch (Bring-up M1, M5 | Teardown M24)
    Launch one instance of given script/executable on each backend. No CPU affinity
    binding should be used (eg: `srun -n <# nodes> -N <# nodes> --cpu_bind=none python3 backend.py)

    The exit code from the WLM should be monitored for errors.

.. _launch-frontend-overlay:

Launch frontend overlay agent (A3-A4)
    Launch TCP transport agent. Over command-line arguments, provide channels for communication to
    and from the frontend (appears collectively as `CTRL ch` in the sequence diagrams),
    IP addresses of all backend compute nodes, and host IDs of all backend compute nodes.

    **NOTE: The Frontend and Backend each manage their own memory pool independent of Local Services
    specifically for managing the Channels necessary for Overlay Network communication**

    In the environment, there will be environment variables set to tell the overlay network what
    gateway channels it should monitor for remote messaging.

    Once the agent is up, it sends a ping saying as much.

.. _overlay-init:

Overlay input arguments (M3, M6):
    .. automodule:: dragon.transport.overlay
        :members: start_overlay_network

.. _launch-backend:

Launch backend and its overlay network (A4-M8)
    Use workload manager to launch 1 instance of backend on each compute node. In command line
    arguments, provide a channel to communicate with frontend, the frontend's IP address, and its
    host ID.

    The Backend will provide the frontend channel, IP, and host ID to start its own TCP overlay
    agent as in :ref:`launch-frontend-overlay`. Using the provided information via the launch command and
    its own locally obtained IP address and host ID, each backend instance will be able to communicate
    directly with the single frontend instance and vice-versa.

    Once backend and its overlay agent is up, it will send a `BEIsUp` message to the frontend. The
    frontend waits till it receives this message from every backend before continuing.

.. _give-node-id:

Provide Node Index to each Backend instance (M9)
    The Frontend will use host ID information in the `BEIsUp` message to assign node indices.
    The primary node is given node index 0 and can be specified by hostname via the user when
    starting the Dragon runtime. The remaining values will be integers up to `N-1`, where is `N`
    is the number of backend compute nodes.

    The node index is given in `FENodeIdxBE`. It travels from the frontend over the channel whose descriptor
    was provided in the `BEIsUp` by the backend.

.. _start-local-services:

Start Local Services (A12-A15)
    This is the most critical part of bring-up. If it is successful, everything else will likely be
    fine.

    Launcher Backend popens Local Services. Over `stdin`, it provides `BENodeIdxSH`. That messages
    contains logging channel descriptor, hostname, primary node status (boolean), node index, and
    IP address for backend communication **This is a potentially different IP address than the one being
    used for the overlay network. If it's the same IP, it IS a different port, as it will be used by a
    different transport agent for communication among other backend nodes, NOT the frontend.**

    `SHPingBE` is returned to the backend over `stdout`. This message contains the channel descriptors that
    will be used for all future communication on the node.

    The backend sends `BEPingSH` to confirm channel comms are successful. `SHChannelsUp` is returned by
    Local Services. It contains channel descriptors for all service communication for its node,
    including Global Services (for the primary node), which every other node will
    use to communicate with the Global Services instance on the primary node. The Host ID is also
    contained in `SHChannelsUp`.

.. _many-to-one:

Many-to-One SHChannelsUP, TAUp, TAHalted, SHHaltBE (Bring-up M14, M21 | Teardown M15, M19):
    These messages are a gather operation from all the backend nodes to the frontend.
    It represents a potential hang situation that requires care to eliminate the likelihood
    of a hung bring-up sequence.

.. _one-to-many:

One-to-Many LAChannelsInfo, SHHaltTA,  (Bring-up M15 | Teardown M10, M16):
    These messages and their communication (one-to-many Bcast) represents one of the biggest bottlenecks in
    the bring-up sequence. For larger messages, the first step in reducing that cost is compressing. Next would
    be to implement a tree Bcast.

.. _transmit-lachannelsinfo:

Transmit LAChannelsInfo (A15-A16)
    This is the most critical message. It contains all information necessary to make execution of
    the Dragon runtime possible.

    The Frontend receives one `SHChannelsUp` from each backend. It aggregates these into `LAChannelsInfo`.
    It is potentially a very large message as it scales linearly with the number of nodes. It is beneficial
    to compress this message before transmitting.

.. _start-transport:

Start Transport Agent (A16-A18)
    Local Services starts the Transport Agent as a child process. Over the command line, it passes the
    node index of the node, a channel descriptor to receive messages from Local Services (TA ch),
    logging channel descriptor. In the environment are environment variables containing channel
    descriptors for gateway channels, to be used for sending and receiving messages off-node.

    Over TA ch, `LAChannelsInfo` is provided, so the transport agent can take IP and host ID info
    to manage communication to other nodes on the backend.

    The transport agent communicates a `TAPingSH` to Local Services over the Local Services
    channel provided for its node index in the `LAChannelsInfo` message. This is sent upstream
    as an `TAUp` message. The frontend waits for such a message from every backend.

.. _transport-cli:

CLI Inputs to Start Transport Agent (M17)
    .. automodule:: dragon.transport
        :members: start_transport_agent


.. _start-global-services:

Start Global Services (A18-A22)
    After sending `TAUp`, Local Services on the primary node starts Global Services. Over `stdin`
    Local Services provides `LAChannelsInfo`. With this information, Global Services is
    immediately able to ping all Local Services' instances on the backend via `GSPingSH` (note:
    the Transport Agent is necessary to route those correctly).

    After receiving a response from every Local Services (`SHPingGS`), it sends a `GSIsUp`
    that is ultimately routed to the Frontend. It is important to note the `GSIsUp` message
    is not guaranteed to arrive after all `TAUp` messages, ie: `GSIsUp` will necessarily
    come after `TAUp` on the primary node but may come before the `TAUp` message from any
    other backend node.

    Once all `TAup` and `GSIsUp` messages are received, the Dragon runtime is fully up.

.. _start-user-program:

Start User Program (A22-A25)
    The Frontend starts the user program. This program was provided with the launcher's invocation.
    It is sent to the primary node via a `GSProcessCreate` message. Upon receipt, Global Services sends
    a `GSProcessCreateResponse`.

    After sending the response, it selects a Local Services to deploy the program to. After selection
    it sends this request as a `SHProcessCreate` to the selected Local Services. This Local Services
    sends a `SHProcessCreateResposne` as receipt.

.. _route-stdout:

Route User Program stdout (A25-A26)
    Local Services ingests all the User Program `stdout` and `stderr`. This is propogated to the
    Frontend as a `SHFwdOutput` message that contains information about what process the strings
    originated from. With this information, the Frontend can provide the output via its `stdout` so
    the user can see it buffered cleanly and correctly with varying levels of verbosity.


.. _MultiNodeTeardown:

Multi Node Teardown
===================

In the multi-node case, there are a few more messages that are processed by
Local Services than in the single-node case. The `GSHalted` message is forwarded
by Local Services to the launcher backend. The `SHHaltTA` is sent to Local Services,
and it forwards the `TAHalted` message to the backend when received.
Other than these three additional messages, the tear down is identical to the
single node version of tear down. One difference is that this tear down process
is repeated on every one of the compute nodes in the case of the multi-node
tear down.

In an abnormal situation, the `AbnormalTermination` message may be received by
the Launcher from either Local Services or Global Services (via the Backend). In
that case, the launcher will initiate a teardown of the infrastructure starting
with sending of `GSTeardown` (message 5 in diagram below).

Sequence diagram
-------------------

.. raw:: html
    :file: images/teardown_seq_multi_node.svg

**Multi-Node Teardown Sequence**

Notes on Teardown Sequence
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. _head-proc-exit:

Head Proc Exit (A1-A4)
    Local Services monitors its managed processes via `waitpid`. Once it registers
    an exit, it matches its pid to the internally tracked and globally unique puid.
    This puid is transmitted via `SHProcessExit` to Global Services.

    Global Services cleans up any resources tied to tracking that specific head process.
    Once that is complete, it alerts the Frontend via a `GSHeadExit`.

.. _death-watcher:

Local Services Death Watcher (M1):
    Local Services' main thread has a thread repeatedly calling `waitpid`. This thread
    will receive the exit code from the head process.

.. _start-teardown:

Begin Teardown (A5-A7)
    With exit of the head process, as of this writing (03/24/2023), the Launcher will begin
    teardown.

    Teardown is initiated via the `GSTeardown` message to Global Services. Once this message is
    sent, every thing is ignored from user input. Teardown is what is being done. `GSTeardown`
    is also the message sent via a SIGINT signal from the user (assuming the full runtime is up).

    Consequently with a full runtime, `GSTeardown` is always the first point of entry for exiting
    the Dragon runtime no matter the state of various individual runtime services.

.. _halt-global-services:

Halt Global Services (A7-A9)
    Once Global Services receives the `GSTeardown`, it detaches from any channels. It does not
    destroy any (aside from any it may have directly created) because Local Services has created the
    channels used for infrastructure and manages the memory created for them.

    Once it has shutdown its resources, it transmits `GSHalted` to Local Services over `stdout`
    and exits. This is forwarded all the way to the Frontend.

.. _halt-transport:

Halt Transport Agent (A9-A12)
    After the Frontend receives `GSHalted`, it initiates teardown of all Backend
    Transport Agents via `SHHaltTA` which is eventually routed to the transport
    agent from Local Services.

    Upon receipt, it cancels any outstanding work it has, sends `TAHalted` over its
    channel to Local Services (**NOTE: This should be over stdout**) and exits.

    Local Services forwards the exit to the Frontend via routing of a `TAHalted` messages.
    The Frontend waits for receipt of N messages before continuing (**NOTE: This can easily
    result in hung processes and should be addressed**).

.. _issue-shteardown:

Issue SHTeardown (A13-A16)
    This is the first step in the more carefully orchestrated destruction of the Backend.
    Once Local Services receives `SHTeardown`, it detaches from its logging channel and
    tranmits a `SHHaltBE` to the Backend.

    After destroying its logging infrastructure, the Backend forwards the `SHHaltBE` to the Frontend.
    The Frontend waits till it receives N `SHHaltBE` messages before exiting. (**NOTE: Another
    potential mess**).

.. _issue-behalted:

Issue BEHalted (A17-A19)
    Frontend sends `BEHalted` to all Backends. Once sent, there is no more direct communication
    between Backend and Frontend. The Frontend will simply confirm completion via exit codes from
    the workload manager.

    Once Local Services receives its `BEHalted`, it deallocates all its allocated memory and exits.

.. _shutdown-be-overlay:

Shutdown Backend and its overlay (A20-A25)
    After transmitting `BEHalted` to Local Services, the Backend issues `SHHaltTA` to its
    overlay transport agent. This triggers teardown identical to the transport referenced in
    :ref:`halt-transport`.

    Once the overlay is down and Local Services is down as measured by a simple `wait` on
    child process exits, the Backend deallocated its managed memory, including logging
    infrastructure and exits.

.. _shutdown-fe-overlay:

Shutdown Frontend (A26-A30)
    After the workload manager returns exit code for the Backend exit, the Frontend shuts down
    its overlay similar to the Backend in :ref:`shutdown-be-overlay`. After waiting on the successful
    overlay exit, it similarly deallocates managed memory and exits.

    After the Frontend exits, the Dragon runtime is down.