dragon.telemetry.analysis
Dragon’s API to access telemetry data from user space
Classes
This is the main user interface to access telmetery data. |
|
- class AnalysisClient
Bases:
object
This is the main user interface to access telmetery data. The client requests data from a server that is started when the telemetry component of the runtime is brought up. Multiple clients can request data from the server. The client API also has a method to reboot the runtime. This is useful for when a user wants to remove a node from the runtime. The reboot will cause the runtime to tear down and restart with the specified nodes removed.
Example usage:
import dragon from dragon.telemetry import AnalysisClient, Telemetry if __name__ == "__main__": dt = Telemetry() dac = AnalysisClient() dac.connect() # while doing some GPU work metrics = dac.get_metrics() if "DeviceUtilization" not in metrics: return data = dac.get_data("DeviceUtilization") averages = [] for data_dict in data: avg = sum(data_dict["dps"].values())/len(data_dict["dps"]): averages.append(avg) worst = min(averages) worst_node_idx = averages.index(worst) node_to_remove = data_dict[worst_node_idx]["tags"]["host"] print(f"Worst Node: {node_to_remove} with minimum average value = {worst}",flush=True) # restart the program with the worst performing node removed dac.reboot(exclude_hostnames=node_to_remove) dt.finalize()
- __init__()
- connect(timeout: int = None) None
A user is required to connect to the server before requesting data. By connecting, a user can add requests to the server’s request queue. A timeout can be provided to wait for the connection.
- Parameters:
timeout (int , optional) – user provided timeout for getting server request queue. Without a timeout this is a blocking call, defaults to None
- Raises:
RuntimeError – if the connection request cannot be completed in the alotted time.
- get_data(metrics: str , start_time: int = None) list
Gathers telmetery data from every node in the allocation for the given metric(s) after the specified start time.
- Parameters:
- Raises:
RuntimeError – raised if the user hasn’t connected to the server
AttributeError – raised if the metric is neither a string nor list
- Returns:
a list containing dictionaries with the response from each node.
- Return type:
- get_metrics() list
Returns all of the metrics that have been collected on any node
- Raises:
RuntimeError – raised if the user hasn’t connected to the server
- Returns:
a list of all metrics that were found
- Return type:
- reboot(*, exclude_huids: list = None, exclude_hostnames: list = None) None
Calling this will reboot the entire runtime and cause the Dragon runtime to begin tearing down immediately. Any methods called after this, whether they interact with Dragon infrastructure or not, should not be expected to complete in an uncorrupted state.