dragon.telemetry.analysis

Dragon’s API to access telemetry data from user space

Classes

`AnalysisClient`	This is the main user interface to access telmetery data.
`AnalysisServer`

class AnalysisClient

Bases: object

This is the main user interface to access telmetery data. The client requests data from a server that is started when the telemetry component of the runtime is brought up. Multiple clients can request data from the server. The client API also has a method to reboot the runtime. This is useful for when a user wants to remove a node from the runtime. The reboot will cause the runtime to tear down and restart with the specified nodes removed.

Example usage:

import dragon
from dragon.telemetry import AnalysisClient, Telemetry

if __name__ == "__main__":
    dt = Telemetry()
    dac = AnalysisClient()
    dac.connect()

    # while doing some GPU work

    metrics = dac.get_metrics()

    if "DeviceUtilization" not in metrics:
        return

    data = dac.get_data("DeviceUtilization")

    averages = []
    for data_dict in data:
        avg = sum(data_dict["dps"].values())/len(data_dict["dps"]):
        averages.append(avg)

    worst = min(averages)
    worst_node_idx = averages.index(worst)
    node_to_remove = data_dict[worst_node_idx]["tags"]["host"]

    print(f"Worst Node: {node_to_remove} with minimum average value = {worst}",flush=True)

    # restart the program with the worst performing node removed
    dac.reboot(exclude_hostnames=node_to_remove)

    dt.finalize()

__init__()

connect(timeout: int = None) → None 

A user is required to connect to the server before requesting data. By connecting, a user can add requests to the server’s request queue. A timeout can be provided to wait for the connection.

Parameters:: timeout (int , optional) – user provided timeout for getting server request queue. Without a timeout this is a blocking call, defaults to None
Raises:: RuntimeError – if the connection request cannot be completed in the alotted time.

get_data(metrics: str , start_time: int = None) → list 

Gathers telmetery data from every node in the allocation for the given metric(s) after the specified start time.

Parameters:

metrics (str or list ) – a metric or list of metrics to gather data for
start_time (int , optional) – the time after which the user wants data collected. By default we will return the last five minutes of data, defaults to None

Raises:

RuntimeError – raised if the user hasn’t connected to the server
AttributeError – raised if the metric is neither a string nor list

Returns:

a list containing dictionaries with the response from each node.

Return type:

list

get_metrics() → list 

Returns all of the metrics that have been collected on any node

Raises:: RuntimeError – raised if the user hasn’t connected to the server
Returns:: a list of all metrics that were found
Return type:: list

reboot(*, exclude_huids: list = None, exclude_hostnames: list = None) → None 

Calling this will reboot the entire runtime and cause the Dragon runtime to begin tearing down immediately. Any methods called after this, whether they interact with Dragon infrastructure or not, should not be expected to complete in an uncorrupted state.

Parameters:

exclude_huids (list of ints, optional) – List of huids to exclude when restarting the runtime, defaults to None
exclude_hostnames (list of strings, optional) – List of hostnames to exclude when restarting the runtime, defaults to None

class AnalysisServer

Bases: object

__init__(queue_dict, return_queue, channel_discovery, shutdown_event, nnodes)

run()