Telemetry with Grafana

Dragon Telemetry enables users to monitor and collect data on critical factors like system performance, and resource utilization. Visualizations can be accessed real-time using Grafana. In this tutorial we will walk through how to enable, view, and add custom metrics using Telemetry.

Running Telemetry

Below is a simple example that initializes a Pool of workers that perform some computation. This is the base example that we will be building on top of during the remainder of this tutorial.

Listing 44 Telemetry Example
 1import dragon
 2import multiprocessing as mp
 3import time
 4import itertools
 5import random
 6
 7def worker(task_id):
 8    start_time = time.time()
 9    res = sum(i * i for i in range(task_id))
10    end_time = time.time()
11    duration = end_time - start_time
12    return f"Task {task_id} completed in {duration:.2f} seconds."
13
14def main():
15    num_cpus = 4
16    # Start Dragon
17    multiprocessing.set_start_method("dragon")
18    # Initialize the pool of workers
19    pool = mp.Pool(num_cpus)
20
21    tasks = [random.randint(900000000, 999999999) for _ in range(30)]
22    start = time.time()
23
24    results = pool.map(worker, tasks)
25    end = time.time()
26    total_duration = end - start
27    for result in results:
28        print(result, flush=True)
29
30    print(f"\nTotal execution time: {total_duration:.2f} seconds.", flush=True)
31    pool.close()
32    pool.join()
33
34if __name__ == "__main__":
35    # Initializes local telemetry object.
36    dt = dragon_telem()
37
38    main()
39
40    # This shuts down the telemetry collection and cleans up the workers and the node local telemetry databases.
41    # This only needs to be called by one process.
42    dt.finalize()
Listing 45 Output running a Pool example with Telemetry enabled
 1> salloc -N 2 --exclusive
 2> dragon --telemetry-level=2 telemetry_tutorial.py
 3Port forwarded using:  ssh -NR localhost:34383:0.0.0.0:34383 pinoak-login1
 4Task 930680889 completed in 52.30 seconds.
 5Task 913000711 completed in 51.44 seconds.
 6Task 953078228 completed in 49.84 seconds.
 7Task 991416633 completed in 59.53 seconds.
 8Task 975347560 completed in 62.78 seconds.
 9Task 995407115 completed in 59.55 seconds.
10Task 996281122 completed in 52.75 seconds.
11Task 925282227 completed in 56.38 seconds.
12Task 913398131 completed in 49.62 seconds.
13Task 949008582 completed in 56.06 seconds.

Accessing Telemetry Metrics

Install Grafana

Grafana can be downloaded here . We suggest following the instructions provided by Grafana to install Grafana locally. We recommend Grafana v10.4.x at this time.

We have created a custom config YAML for Grafana OpenTSDB connection setup that can be found here https://github.com/DragonHPC/dragon/blob/main/src/dragon/telemetry/imports/custom.yaml. Place this file where you have installed Grafana in grafana/conf/provisioning/datasources.

Do not replace default.yml

We have also provided a custom dashboard that can be found here https://github.com/DragonHPC/dragon/blob/main/src/dragon/telemetry/imports/Grafana_DragonTelemetryDashboard.json. This dashboard can be imported in Grafana.

Port forwarding

There are two instances where port forwarding is required. These two port forwards together give Grafana access to the Aggregator to make requests for metrics.

  1. Aggregator
    • Compute node to login node

    • The command for this will be printed in the output when the application is run with the telemetry flag

    • Example - ssh command: ssh -NL localhost:4242:localhost:4242 pinoak0027

    • NOTE: Run this step only after the command has been printed for you.

  2. Grafana to Aggregator
    • This step may vary depending on where the Grafana server is running

    • If Grafana is on localhost (laptop):
      • On laptop - ssh -NL localhost:4242:localhost:4242 <user>@<login-node>

    • NOTE: This can be setup anytime and left running. This provides Grafana access to the dragon implementation of the OpenTSDB datasource. Users should still use the default http://localhost:3000 to open Grafana in their browser.

../_images/grafana_port_forwarding.jpg

Grafana Dashboard

(Skip step 1. if you have added custom files to the config directory).

  1. Navigate to Home > Connections > Datasources. Click on Add New Datasource and selct OpenTSDB.

    Fill in details as shown below. Click on Save & Test.

    ../_images/grafana_datasource.jpg
  2. Navigate to Home > Dashboards. Click on New > Import. Upload the custom Dashboard.

  3. Go to Home > Dashboards and select your dashboard. You should see something like below -

    ../_images/grafana_dashboard.jpg
    • Select your datasource name from the dropdown [1]

    • These are CPU metrics [2]

    • Each row can be expanded for different telemetry levels. They have their own set of panels [3]

  4. Click on the top right corner of any panel to edit it.

    ../_images/grafana_panel.jpg

Troubleshooting

Sometimes metrics don’t show up on Grafana dashboard panels (if it’s the first time you’re running Grafana and Telemetry). To resolve this: - Verify that Grafana is able to access Telemetry by clicking on Save & Test under the OpenTSDB datasource. - Refresh Grafana Panels individually

  • Click on the Edit option of any panel.

  • Click on the Datasource dropdown and re-select OpenTSDB

  • Click on the Metric dropdown and re-type the metric name.

  • Click on Apply (top-right corner)

  • Repeat for all panels

  • Save the Dashboard

Custom Metrics

To add custom metrics, we initialize Dragon Telemetry, and use the add_data method to add metrics. Using the previous code, we will keep the main() function as is, and make the following changes to the rest of the code.

Listing 46 Telemetry Custom Metrics
 1import dragon
 2import multiprocessing as mp
 3import time
 4import itertools
 5import random
 6from dragon.telemetry.telemetry import Telemetry as dragon_telem
 7
 8def worker(task_id):
 9    dt = dragon_telem()
10    start_time = time.time()
11    res = sum(i * i for i in range(task_id))
12    end_time = time.time()
13    duration = end_time - start_time
14    # Add custom metric
15    dt.add_data(ts_metric_name= "duration", ts_data=duration, telemetry_level=2, tagk="task_id", tagv=task_id)
16    return f"Task {task_id} completed in {duration:.2f} seconds."
17
18def main():
19    # Same as above code
20
21if __name__ == "__main__":
22
23    # Initializes local telemetry object.
24    dt = dragon_telem()
25
26    main()
27
28    # This shuts down the telemetry collection and cleans up the workers and the node local telemetry databases.
29    # This only needs to be called by one process.
30    dt.finalize()

Telemetry levels

In the previous run we used the argument --telemetry-level=2. This means that metrics will be collected for all telemetry levels that are less than or equal to the given telemetry level argument.

Running with --telemetry-level=3 enables GPU metrics.

User defined metrics are by default set to a telemetry level of 1 but can be changed, as demonstrated in the example. Running without the arg will mean that no telemetry data is collected. All of user generated telemetry data can be left in the user program but will not be collected.

Additional configurations