dragon.telemetry.analysis

Dragon’s API to access telemetry data from user space

Classes

AnalysisClient

This is the main user interface to access telmetery data.

AnalysisServer

class AnalysisClient

Bases: object

This is the main user interface to access telmetery data. The client requests data from a server that is started when the telemetry component of the runtime is brought up. Multiple clients can request data from the server. The client API also has a method to reboot the runtime. This is useful for when a user wants to remove a node from the runtime. The reboot will cause the runtime to tear down and restart with the specified nodes removed.

Example usage:

import dragon
from dragon.telemetry import AnalysisClient, Telemetry

if __name__ == "__main__":
    dt = Telemetry()
    dac = AnalysisClient()
    dac.connect()

    # while doing some GPU work

    metrics = dac.get_metrics()

    if "DeviceUtilization" not in metrics:
        return

    data = dac.get_data("DeviceUtilization")

    averages = []
    for data_dict in data:
        avg = sum(data_dict["dps"].values())/len(data_dict["dps"]):
        averages.append(avg)

    worst = min(averages)
    worst_node_idx = averages.index(worst)
    node_to_remove = data_dict[worst_node_idx]["tags"]["host"]

    print(f"Worst Node: {node_to_remove} with minimum average value = {worst}",flush=True)

    # restart the program with the worst performing node removed
    dac.reboot(exclude_hostnames=node_to_remove)

    dt.finalize()
__init__()
connect(timeout: int = None) None

A user is required to connect to the server before requesting data. By connecting, a user can add requests to the server’s request queue. A timeout can be provided to wait for the connection.

Parameters:

timeout (int , optional) – user provided timeout for getting server request queue. Without a timeout this is a blocking call, defaults to None

Raises:

RuntimeError – if the connection request cannot be completed in the alotted time.

get_data(metrics: str , start_time: int = None) list

Gathers telmetery data from every node in the allocation for the given metric(s) after the specified start time.

Parameters:
  • metrics (str or list ) – a metric or list of metrics to gather data for

  • start_time (int , optional) – the time after which the user wants data collected. By default we will return the last five minutes of data, defaults to None

Raises:
Returns:

a list containing dictionaries with the response from each node.

Return type:

list

get_metrics() list

Returns all of the metrics that have been collected on any node

Raises:

RuntimeError – raised if the user hasn’t connected to the server

Returns:

a list of all metrics that were found

Return type:

list

reboot(*, exclude_huids: list = None, exclude_hostnames: list = None) None

Calling this will reboot the entire runtime and cause the Dragon runtime to begin tearing down immediately. Any methods called after this, whether they interact with Dragon infrastructure or not, should not be expected to complete in an uncorrupted state.

Parameters:
  • exclude_huids (list of ints, optional) – List of huids to exclude when restarting the runtime, defaults to None

  • exclude_hostnames (list of strings, optional) – List of hostnames to exclude when restarting the runtime, defaults to None

class AnalysisServer

Bases: object

__init__(queue_dict, return_queue, channel_discovery, shutdown_event, nnodes)
run()