dragon.ai.inference.inference_utils.Inference
- class Inference[source]
Bases:
objectThis class is the starting point for initializing the inference pipeline. It provides distributed multi-gpu & multi-node inference capabilities for low-latency high throughput GenAI inference. Furthermore, inference batching feature for optimizing performance is enabled, along with dynamic inference worker spin-up and spin-down capabilities to reduce power consumption and carbon emissions is also enabled.
- __init__(config: InferenceConfig, input_queue) None [source]
Initialize a
Inferenceinstance.- Parameters:
config (InferenceConfig) – Type-safe configuration object.
input_queue (dragon.native.Queue) – Input queue that feeds user prompts into the backend service.
Methods
__init__(config, input_queue)Initialize a
Inferenceinstance.Create CPU-head workers and associated inference workers by node.
destroy()Destroys all spun-up processes and terminates the application.
Get all nodes in current dragon enabled allocation.
Initializes the backend services to spin up the GenAI inference application.
If the number of Node(s) and GPU(s) specified in config.yaml are different than default (full utilization), subset the nodes and GPUs accordingly.
query(q_item)Queries the dragon-inference application to generate a response from the GenAI model.
tp_args_validator(app_gpus)Validate tensor-parallel size against GPUs available per node.
- __init__(config: InferenceConfig, input_queue) None [source]
Initialize a
Inferenceinstance.- Parameters:
config (InferenceConfig) – Type-safe configuration object.
input_queue (dragon.native.Queue) – Input queue that feeds user prompts into the backend service.
- get_nodes_in_alloc()[source]
Get all nodes in current dragon enabled allocation.
- Returns:
Dictionary of all available nodes in the allocation. Keys are hostnames, values are
dragon.native.machine.Nodeobjects.- Return type:
- maybe_subset_nodes_gpus()[source]
If the number of Node(s) and GPU(s) specified in config.yaml are different than default (full utilization), subset the nodes and GPUs accordingly.
- Returns:
Dictionary mapping (hostname, Dragon Node) tuples to GPU ranks.
- Return type:
- tp_args_validator(app_gpus)[source]
Validate tensor-parallel size against GPUs available per node.
- Parameters:
app_gpus (int ) – Number of GPUs available in each node.
- create_cpu_device_workers_by_node()[source]
Create CPU-head workers and associated inference workers by node.
- query(q_item)[source]
Queries the dragon-inference application to generate a response from the GenAI model.
- Parameters:
q_item (tuple ) – Tuple of the form
(user_input, response_queue)whereuser_inputis a string or list of strings andresponse_queueis andragon.native.Queueused to receive responses.