An example using Dragon Telemetry
The Telemetry feature enables users to monitor and collect data on critical factors like system performance, and resource utilization. Being able to visualize and gain deep insights on these metrics is crucial for optimizing resource allocation, identifying and diagnosing issues and ensuring the efficiency of compute nodes. Telemetry comes with the option to add custom metrics tailored to the user application unique needs. This can be done using a simple interface that we have provided.
Why not Prometheus? Prometheus is an open-source monitoring tool that collects and stores metrics as time-series data. Prometheus and Grafana are widely used together for telemetry services. Dragon Telemetry is implemented on top of the Dragon infrastructure, which helps improve scalability of Telemetry as the user application scales up. All off-node communication is happening over Dragon so, when applicable, the high-speed transport agent is used. It avoids the use of third-party software and enables users to monitor both user specific metrics and Dragon specific metrics.
The program below demonstrates how users can add custom metrics:
import dragon
import argparse
import multiprocessing
import numpy as np
import scipy.signal
import time
import itertools
from dragon.telemetry.telemetry import Telemetry as dragon_telem
def get_args():
parser = argparse.ArgumentParser(description="Basic SciPy test")
parser.add_argument("--num_workers", type=int, default=4,
help="number of workers")
parser.add_argument("--iterations", type=int, default=10,
help="number of iterations to do")
parser.add_argument("--burns", type=int, default=2,
help="number of iterations to burn/ignore in order to warm up")
parser.add_argument("--dragon", action="store_true",
help="run with dragon objs")
parser.add_argument("--size", type=int, default=1000,
help="size of the array")
parser.add_argument("--mem", type=int, default=(512 * 1024 * 1024),
help="overall footprint of image dataset to process")
parser.add_argument('--work_time', type=float, default=0.0,
help='how many seconds of compute per image')
my_args = parser.parse_args()
return my_args
def f(args):
# Do some image processing
image, random_filter, work_time = args
elapsed = 0.
start = time.perf_counter()
last = None
# Explicitly control compute time per image
while elapsed < work_time:
last = scipy.signal.convolve2d(image, random_filter)[::5, ::5]
elapsed = time.perf_counter() - start
return last
if __name__ == "__main__":
args = get_args()
# Initializes local telemetry object. This has to be done for each process that adds data
dt = dragon_telem()
# Start Dragon or base Multiprocessing
if args.dragon:
print("using dragon runtime")
multiprocessing.set_start_method("dragon")
else:
print("using regular mp libs/calls")
multiprocessing.set_start_method("spawn")
# Image and filter generation
image = np.zeros((args.size, args.size))
nimages = int(float(args.mem) / float(image.size))
print(f"Number of images: {nimages}", flush=True)
images = []
images.append(image)
for j in range(nimages-1):
images.append(np.zeros((args.size, args.size)))
filters = [np.random.normal(size=(4, 4)) for _ in range(nimages)]
num_cpus = args.num_workers
print(f"Number of workers: {num_cpus}", flush=True)
# Initialize the pool of workers
start = time.perf_counter()
pool = multiprocessing.Pool(num_cpus)
pool_start_time = time.perf_counter() - start
dt.add_data("pool_start_time", pool_start_time)
# Main body of the computation
times = []
for i in range(args.iterations + args.burns):
start = time.perf_counter()
res = pool.map(f, zip(images, filters, itertools.repeat(float(args.work_time))))
del res
times.append(time.perf_counter() - start)
# this will only be collected if the telemetry level is >= 3
dt.add_data("iteration", i, telemetry_level=3)
# this will be collected any time telemetry data is being collected
dt.add_data("iteration_time", times[i])
print(f"Time for iteration {i} is {times[i]} seconds", flush=True)
pool.close()
pool.join()
# Print out the mean and standard deviation, excluding the burns iterations
print(f"Average time: {round(np.mean(times[args.burns:]), 2)} second(s)")
print(f"Standard deviation: {round(np.std(times[args.burns:]), 2)} second(s)")
# This shuts down the telemetry collection and cleans up the workers and the node local telemetry databases.
# This only needs to be called by one process.
dt.finalize()
Telemetry Interface for User Application
We have exposed the following methods for users to add user generated data
Method: add_data
Description: Insert user defined metrics to node local database. Currently there is not a way to write data into the same metric name from multiple processes on the same node and visualize that data separated by the process ID in Grafana. We do plan to support that in the future.
Method: finalize
Description: Indicate that user application has finished running, and that Telemetry services can be shut down.
NOTE: If this method is not called, Telemetry will not get the message that it has to shutdown, and user will have to sigterm/Ctrl+C to exit.
Installation
After installing dragon, the only other dependency that needs to be manually installed is the user’s Grafana server.
Grafana can be downloaded here. We suggest following the instructions provided by Grafana to install Grafana locally. We recommend Grafana v10.4.x at this time.
We have created a custom config YAML for Grafana OpenTSDB connection setup that can be found in /dragon/telemetry/imports/custom.yaml
Place this file where you have installed Grafana in grafana/conf/provisioning/datasources
.
Do not replace default.yml
We have also provided a custom dashboard that can be found in /dragon/telemetry/imports/Grafana_DragonTelemetryDashboard.json
- For convenience, those files are shown below:
Config YAML for Grafana OpenTSDB Connection
# Configuration file version apiVersion: 1 # # List of data sources to insert/update depending on what's # # available in the database. datasources: - name: OpenTSDB type: opentsdb url: http://localhost:4242 isDefault: true jsonData: tsdbVersion: 3 editable: true
Config YAML for Grafana OpenTSDB Connection
{ "annotations": { "list": [ { "builtIn": 1, "datasource": { "type": "grafana", "uid": "-- Grafana --" }, "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "type": "dashboard" } ] }, "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, "id": 1, "links": [], "liveNow": true, "panels": [ { "collapsed": true, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }, "id": 8, "panels": [ { "datasource": { "type": "opentsdb", "uid": "adg5rnop5kbggc" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisBorderShow": false, "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "insertNulls": false, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 1 }, "id": 7, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "aggregator": "sum", "datasource": { "type": "opentsdb", "uid": "adg5rnop5kbggc" }, "downsampleAggregator": "avg", "downsampleFillPolicy": "none", "filters": [ { "filter": "*", "groupBy": false, "tagk": "host", "type": "wildcard" } ], "metric": "iteration_time", "refId": "A" } ], "title": "Iteration Time", "type": "timeseries" }, { "datasource": { "type": "opentsdb", "uid": "adg5rnop5kbggc" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisBorderShow": false, "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "insertNulls": false, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 1 }, "id": 9, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "targets": [ { "aggregator": "sum", "datasource": { "type": "opentsdb", "uid": "adg5rnop5kbggc" }, "downsampleAggregator": "avg", "downsampleFillPolicy": "none", "filters": [ { "filter": "*", "groupBy": false, "tagk": "host", "type": "wildcard" } ], "metric": "iteration", "refId": "A" }, { "aggregator": "sum", "datasource": { "type": "opentsdb", "uid": "adg5rnop5kbggc" }, "downsampleAggregator": "avg", "downsampleFillPolicy": "none", "filters": [ { "filter": "*", "groupBy": false, "tagk": "host", "type": "wildcard" } ], "hide": false, "metric": "cpu_percent", "refId": "B" } ], "title": "Iteration and CPU Utilization", "type": "timeseries" } ], "title": "Custom Metrics", "type": "row" }, { "collapsed": false, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 1 }, "id": 5, "panels": [], "title": "CPU Metrics", "type": "row" }, { "datasource": { "type": "opentsdb", "uid": "adg5rnop5kbggc" }, "description": "", "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisBorderShow": false, "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "insertNulls": false, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 12, "w": 20, "x": 0, "y": 2 }, "id": 2, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "pluginVersion": "10.0.0", "targets": [ { "aggregator": "sum", "datasource": { "type": "opentsdb", "uid": "adg5rnop5kbggc" }, "downsampleAggregator": "avg", "downsampleFillPolicy": "none", "explicitTags": false, "filters": [ { "filter": "*", "groupBy": false, "tagk": "host", "type": "wildcard" } ], "metric": "load_average", "refId": "A", "tags": {} } ], "title": "Load Average", "transformations": [ { "id": "concatenate", "options": { "frameNameLabel": "frame", "frameNameMode": "field" } } ], "type": "timeseries" }, { "datasource": { "type": "opentsdb", "uid": "adg5rnop5kbggc" }, "description": "", "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisBorderShow": false, "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "insertNulls": false, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 12, "w": 20, "x": 0, "y": 14 }, "id": 3, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "pluginVersion": "10.0.0", "targets": [ { "aggregator": "sum", "datasource": { "type": "opentsdb", "uid": "adg5rnop5kbggc" }, "downsampleAggregator": "avg", "downsampleFillPolicy": "none", "explicitTags": false, "filters": [ { "filter": "*", "groupBy": false, "tagk": "host", "type": "wildcard" } ], "metric": "cpu_percent", "refId": "A", "tags": {} } ], "title": "CPU Utilization", "transformations": [ { "id": "concatenate", "options": { "frameNameLabel": "frame", "frameNameMode": "field" } } ], "type": "timeseries" }, { "datasource": { "type": "opentsdb", "uid": "adg5rnop5kbggc" }, "description": "", "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisBorderShow": false, "axisCenteredZero": false, "axisColorMode": "text", "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "insertNulls": false, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 12, "w": 20, "x": 0, "y": 26 }, "id": 4, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "single", "sort": "none" } }, "pluginVersion": "10.0.0", "targets": [ { "aggregator": "sum", "datasource": { "type": "opentsdb", "uid": "adg5rnop5kbggc" }, "downsampleAggregator": "avg", "downsampleFillPolicy": "none", "explicitTags": false, "filters": [ { "filter": "*", "groupBy": false, "tagk": "host", "type": "wildcard" } ], "metric": "used_RAM", "refId": "A", "tags": {} } ], "title": "Memory Utilization", "transformations": [ { "id": "concatenate", "options": { "frameNameLabel": "frame", "frameNameMode": "field" } } ], "type": "timeseries" } ], "refresh": "10s", "schemaVersion": 39, "tags": [], "templating": { "list": [] }, "time": { "from": "now-5m", "to": "now" }, "timepicker": {}, "timezone": "", "title": "Telemetry - OpenTSDB", "uid": "f9ce5d3d-7738-4a3b-aa36-e954f5757865", "version": 10, "weekStart": "" }
Steps to import -
Go to the Dashboard section in Grafana
Click on the New (dropdown) in the top right corner and select Import.
Upload the custom JSON config.
Click on Import. You should see the dashboard.
Description of the system used
For this example, an HPE Cray EX was used. Each node has AMD EPYC 7763 64-core CPUs.
How to run
There are two instances where port forwarding is required. These two port forwards together give Grafana access to the Aggregator to make requests for data.
- Aggregator
Compute node to login node
The command for this will be printed in the output when the application is run with the telemetry flag
Example -
ssh command: ssh -NL localhost:4242:localhost:4242 pinoak0027
NOTE: Run this step only after the command has been printed for you.
- Grafana to Aggregator
This step may vary depending on where Grafana server is installed and running
- If Grafana is on localhost (laptop):
On laptop -
ssh -NL localhost:4242:localhost:4242 <user>@<login-node>
NOTE: This can be setup anytime and left running. This provides Grafana access to the dragon implementation of the OpenTSDB datasource. Users should still use the default
http://localhost:3000
to open Grafana in their browser.
Note, that what we describe here assumes that the user can ssh to a specific login node and then ssh to the specific compute node. If there are extra proxies that the users are required to jump through then extra tunnels will need to be set up. If there are multiple login nodes, make sure that the tunnels are set up to use the same login node. If the login node is chosen non-deterministically, using ssh -NR
from the login node back to the proxy is a potential solution.
Example Output when run on 4 nodes with telemetry enabled
1> salloc -N 4
2> dragon --telemetry-level=2 scipy_scale_work.py --dragon --num_workers 512 --iterations 4 --burns 1 --size 32 --mem 33554432 --work_time 0.5
3ssh command: ssh -NL localhost:4242:localhost:4242 pinoak0027
4using dragon runtime
5Number of images: 32768
6Number of workers: 512
7Time for iteration 0 is 39.586153381998884 seconds
8Time for iteration 1 is 34.648637460995815 seconds
9Time for iteration 2 is 35.434623396999086 seconds
10Time for iteration 3 is 33.31561650399817 seconds
11Time for iteration 4 is 33.696790750007494 seconds
12Average time: 34.27 second(s)
13Standard deviation: 0.83 second(s)
Troubleshooting
Listed below are some scenarios that we ran into and the steps we took to solve them.
- Metrics aren’t showing up on Grafana dashboard panels
You might encounter this the first time running Grafana and Telemetry
- Verify that Grafana is able to access Telemetry -
Go to the Datasources tab in the navigation.
Select OpenTSDB, scroll to the bottom and click on the Save and Test button.
You should see a connection confirmation indicated by a green notification. If you don’t, double check ssh tunnels. Check for messages like: “open failed: connect failed: Connection refused”
- Refresh Grafana Panels individually
Click on the Edit option of any panel.
Click on the Datasource dropdown and re-select OpenTSDB
Click on the Metric dropdown and re-type the metric name.
Click on Apply (top-right corner)
Repeat for all panels
Save the Dashboard