An example using Dragon Telemetry

The Telemetry feature enables users to monitor and collect data on critical factors like system performance, and resource utilization. Being able to visualize and gain deep insights on these metrics is crucial for optimizing resource allocation, identifying and diagnosing issues and ensuring the efficiency of compute nodes. Telemetry comes with the option to add custom metrics tailored to the user application unique needs. This can be done using a simple interface that we have provided.

Why not Prometheus? Prometheus is an open-source monitoring tool that collects and stores metrics as time-series data. Prometheus and Grafana are widely used together for telemetry services. Dragon Telemetry is implemented on top of the Dragon infrastructure, which helps improve scalability of Telemetry as the user application scales up. All off-node communication is happening over Dragon so, when applicable, the high-speed transport agent is used. It avoids the use of third-party software and enables users to monitor both user specific metrics and Dragon specific metrics.

The program below demonstrates how users can add custom metrics:

import dragon
import argparse
import multiprocessing
import numpy as np
import scipy.signal
import time
import itertools
from dragon.telemetry.telemetry import Telemetry as dragon_telem

def get_args():
    parser = argparse.ArgumentParser(description="Basic SciPy test")

    parser.add_argument("--num_workers", type=int, default=4,
                        help="number of workers")

    parser.add_argument("--iterations", type=int, default=10,
                        help="number of iterations to do")

    parser.add_argument("--burns", type=int, default=2,
                        help="number of iterations to burn/ignore in order to warm up")

    parser.add_argument("--dragon", action="store_true",
                        help="run with dragon objs")

    parser.add_argument("--size", type=int, default=1000,
                        help="size of the array")

    parser.add_argument("--mem", type=int, default=(512 * 1024 * 1024),
                        help="overall footprint of image dataset to process")

    parser.add_argument('--work_time', type=float, default=0.0,
                        help='how many seconds of compute per image')

    my_args = parser.parse_args()

    return my_args


def f(args):
    # Do some image processing
    image, random_filter, work_time = args
    elapsed = 0.
    start = time.perf_counter()
    last = None
    # Explicitly control compute time per image
    while elapsed < work_time:
        last = scipy.signal.convolve2d(image, random_filter)[::5, ::5]
        elapsed = time.perf_counter() - start

    return last


if __name__ == "__main__":
    args = get_args()
    # Initializes local telemetry object. This has to be done for each process that adds data
    dt = dragon_telem()
    # Start Dragon or base Multiprocessing
    if args.dragon:
        print("using dragon runtime")
        multiprocessing.set_start_method("dragon")
    else:
        print("using regular mp libs/calls")
        multiprocessing.set_start_method("spawn")

    # Image and filter generation
    image = np.zeros((args.size, args.size))
    nimages = int(float(args.mem) / float(image.size))
    print(f"Number of images: {nimages}", flush=True)
    images = []
    images.append(image)
    for j in range(nimages-1):
        images.append(np.zeros((args.size, args.size)))
    filters = [np.random.normal(size=(4, 4)) for _ in range(nimages)]

    num_cpus = args.num_workers
    print(f"Number of workers: {num_cpus}", flush=True)

    # Initialize the pool of workers
    start = time.perf_counter()
    pool = multiprocessing.Pool(num_cpus)
    pool_start_time = time.perf_counter() - start
    dt.add_data("pool_start_time", pool_start_time)

    # Main body of the computation
    times = []
    for i in range(args.iterations + args.burns):
        start = time.perf_counter()
        res = pool.map(f, zip(images, filters, itertools.repeat(float(args.work_time))))
        del res
        times.append(time.perf_counter() - start)
        # this will only be collected if the telemetry level is >= 3
        dt.add_data("iteration", i, telemetry_level=3)
        # this will be collected any time telemetry data is being collected
        dt.add_data("iteration_time", times[i])
        print(f"Time for iteration {i} is {times[i]} seconds", flush=True)

    pool.close()
    pool.join()
    # Print out the mean and standard deviation, excluding the burns iterations
    print(f"Average time: {round(np.mean(times[args.burns:]), 2)} second(s)")
    print(f"Standard deviation: {round(np.std(times[args.burns:]), 2)} second(s)")
    # This shuts down the telemetry collection and cleans up the workers and the node local telemetry databases.
    # This only needs to be called by one process.
    dt.finalize()

Telemetry Interface for User Application

We have exposed the following methods for users to add user generated data

Method: add_data

Description: Insert user defined metrics to node local database. Currently there is not a way to write data into the same metric name from multiple processes on the same node and visualize that data separated by the process ID in Grafana. We do plan to support that in the future.

Method: finalize

Description: Indicate that user application has finished running, and that Telemetry services can be shut down.

NOTE: If this method is not called, Telemetry will not get the message that it has to shutdown, and user will have to sigterm/Ctrl+C to exit.

Installation

After installing dragon, the only other dependency that needs to be manually installed is the user’s Grafana server.

Grafana can be downloaded here. We suggest following the instructions provided by Grafana to install Grafana locally. We recommend Grafana v10.4.x at this time.

We have created a custom config YAML for Grafana OpenTSDB connection setup that can be found in /dragon/telemetry/imports/custom.yaml Place this file where you have installed Grafana in grafana/conf/provisioning/datasources. Do not replace default.yml

We have also provided a custom dashboard that can be found in /dragon/telemetry/imports/Grafana_DragonTelemetryDashboard.json

For convenience, those files are shown below:

Config YAML for Grafana OpenTSDB Connection

# Configuration file version
apiVersion: 1


# # List of data sources to insert/update depending on what's
# # available in the database.
datasources:
  - name: OpenTSDB
    type: opentsdb
    url: http://localhost:4242
    isDefault: true
    jsonData:
      tsdbVersion: 3
    editable: true

Config YAML for Grafana OpenTSDB Connection

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 1,
  "links": [],
  "liveNow": true,
  "panels": [
    {
      "collapsed": true,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 8,
      "panels": [
        {
          "datasource": {
            "type": "opentsdb",
            "uid": "adg5rnop5kbggc"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisBorderShow": false,
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "insertNulls": false,
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 1
          },
          "id": 7,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "aggregator": "sum",
              "datasource": {
                "type": "opentsdb",
                "uid": "adg5rnop5kbggc"
              },
              "downsampleAggregator": "avg",
              "downsampleFillPolicy": "none",
              "filters": [
                {
                  "filter": "*",
                  "groupBy": false,
                  "tagk": "host",
                  "type": "wildcard"
                }
              ],
              "metric": "iteration_time",
              "refId": "A"
            }
          ],
          "title": "Iteration Time",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "opentsdb",
            "uid": "adg5rnop5kbggc"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisBorderShow": false,
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "insertNulls": false,
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 1
          },
          "id": 9,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "aggregator": "sum",
              "datasource": {
                "type": "opentsdb",
                "uid": "adg5rnop5kbggc"
              },
              "downsampleAggregator": "avg",
              "downsampleFillPolicy": "none",
              "filters": [
                {
                  "filter": "*",
                  "groupBy": false,
                  "tagk": "host",
                  "type": "wildcard"
                }
              ],
              "metric": "iteration",
              "refId": "A"
            },
            {
              "aggregator": "sum",
              "datasource": {
                "type": "opentsdb",
                "uid": "adg5rnop5kbggc"
              },
              "downsampleAggregator": "avg",
              "downsampleFillPolicy": "none",
              "filters": [
                {
                  "filter": "*",
                  "groupBy": false,
                  "tagk": "host",
                  "type": "wildcard"
                }
              ],
              "hide": false,
              "metric": "cpu_percent",
              "refId": "B"
            }
          ],
          "title": "Iteration and CPU Utilization",
          "type": "timeseries"
        }
      ],
      "title": "Custom Metrics",
      "type": "row"
    },
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 1
      },
      "id": 5,
      "panels": [],
      "title": "CPU Metrics",
      "type": "row"
    },
    {
      "datasource": {
        "type": "opentsdb",
        "uid": "adg5rnop5kbggc"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 12,
        "w": 20,
        "x": 0,
        "y": 2
      },
      "id": 2,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "10.0.0",
      "targets": [
        {
          "aggregator": "sum",
          "datasource": {
            "type": "opentsdb",
            "uid": "adg5rnop5kbggc"
          },
          "downsampleAggregator": "avg",
          "downsampleFillPolicy": "none",
          "explicitTags": false,
          "filters": [
            {
              "filter": "*",
              "groupBy": false,
              "tagk": "host",
              "type": "wildcard"
            }
          ],
          "metric": "load_average",
          "refId": "A",
          "tags": {}
        }
      ],
      "title": "Load Average",
      "transformations": [
        {
          "id": "concatenate",
          "options": {
            "frameNameLabel": "frame",
            "frameNameMode": "field"
          }
        }
      ],
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "opentsdb",
        "uid": "adg5rnop5kbggc"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 12,
        "w": 20,
        "x": 0,
        "y": 14
      },
      "id": 3,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "10.0.0",
      "targets": [
        {
          "aggregator": "sum",
          "datasource": {
            "type": "opentsdb",
            "uid": "adg5rnop5kbggc"
          },
          "downsampleAggregator": "avg",
          "downsampleFillPolicy": "none",
          "explicitTags": false,
          "filters": [
            {
              "filter": "*",
              "groupBy": false,
              "tagk": "host",
              "type": "wildcard"
            }
          ],
          "metric": "cpu_percent",
          "refId": "A",
          "tags": {}
        }
      ],
      "title": "CPU Utilization",
      "transformations": [
        {
          "id": "concatenate",
          "options": {
            "frameNameLabel": "frame",
            "frameNameMode": "field"
          }
        }
      ],
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "opentsdb",
        "uid": "adg5rnop5kbggc"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisBorderShow": false,
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "insertNulls": false,
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 12,
        "w": 20,
        "x": 0,
        "y": 26
      },
      "id": 4,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "pluginVersion": "10.0.0",
      "targets": [
        {
          "aggregator": "sum",
          "datasource": {
            "type": "opentsdb",
            "uid": "adg5rnop5kbggc"
          },
          "downsampleAggregator": "avg",
          "downsampleFillPolicy": "none",
          "explicitTags": false,
          "filters": [
            {
              "filter": "*",
              "groupBy": false,
              "tagk": "host",
              "type": "wildcard"
            }
          ],
          "metric": "used_RAM",
          "refId": "A",
          "tags": {}
        }
      ],
      "title": "Memory Utilization",
      "transformations": [
        {
          "id": "concatenate",
          "options": {
            "frameNameLabel": "frame",
            "frameNameMode": "field"
          }
        }
      ],
      "type": "timeseries"
    }
  ],
  "refresh": "10s",
  "schemaVersion": 39,
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-5m",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "Telemetry - OpenTSDB",
  "uid": "f9ce5d3d-7738-4a3b-aa36-e954f5757865",
  "version": 10,
  "weekStart": ""
}

Steps to import -

Go to the Dashboard section in Grafana

Click on the New (dropdown) in the top right corner and select Import.

Upload the custom JSON config.

Click on Import. You should see the dashboard.

Description of the system used

For this example, an HPE Cray EX was used. Each node has AMD EPYC 7763 64-core CPUs.

How to run

There are two instances where port forwarding is required. These two port forwards together give Grafana access to the Aggregator to make requests for data.

Aggregator
- Compute node to login node
- The command for this will be printed in the output when the application is run with the telemetry flag
- Example - ssh command: ssh -NL localhost:4242:localhost:4242 pinoak0027
- NOTE: Run this step only after the command has been printed for you.
Grafana to Aggregator
- This step may vary depending on where Grafana server is installed and running
- If Grafana is on localhost (laptop):
  
  On laptop - ssh -NL localhost:4242:localhost:4242 <user>@<login-node>
- NOTE: This can be setup anytime and left running. This provides Grafana access to the dragon implementation of the OpenTSDB datasource. Users should still use the default http://localhost:3000 to open Grafana in their browser.

Note, that what we describe here assumes that the user can ssh to a specific login node and then ssh to the specific compute node. If there are extra proxies that the users are required to jump through then extra tunnels will need to be set up. If there are multiple login nodes, make sure that the tunnels are set up to use the same login node. If the login node is chosen non-deterministically, using ssh -NR from the login node back to the proxy is a potential solution.

Example Output when run on 4 nodes with telemetry enabled

Listing 33 Output when running scipy_scale_work.py with telemetry enabled and some user generated metrics

> salloc -N 4
> dragon --telemetry-level=2 scipy_scale_work.py --dragon --num_workers 512 --iterations 4 --burns 1 --size 32 --mem 33554432 --work_time 0.5
ssh command: ssh -NL localhost:4242:localhost:4242 pinoak0027
using dragon runtime
Number of images: 32768
Number of workers: 512
Time for iteration 0 is 39.586153381998884 seconds
Time for iteration 1 is 34.648637460995815 seconds
Time for iteration 2 is 35.434623396999086 seconds
Time for iteration 3 is 33.31561650399817 seconds
Time for iteration 4 is 33.696790750007494 seconds
Average time: 34.27 second(s)
Standard deviation: 0.83 second(s)

Troubleshooting

Listed below are some scenarios that we ran into and the steps we took to solve them.

Metrics aren’t showing up on Grafana dashboard panels
- You might encounter this the first time running Grafana and Telemetry
- Verify that Grafana is able to access Telemetry -
  
  Go to the Datasources tab in the navigation.
  
  Select OpenTSDB, scroll to the bottom and click on the Save and Test button.
  
  You should see a connection confirmation indicated by a green notification. If you don’t, double check ssh tunnels. Check for messages like: “open failed: connect failed: Connection refused”
- Refresh Grafana Panels individually
  
  Click on the Edit option of any panel.
  
  Click on the Datasource dropdown and re-select OpenTSDB
  
  Click on the Metric dropdown and re-type the metric name.
  
  Click on Apply (top-right corner)
  
  Repeat for all panels
- Save the Dashboard