dragon.ai.inference.inference_utils.Inference

Bases: object

This class is the starting point for initializing the inference pipeline. It provides distributed multi-gpu & multi-node inference capabilities for low-latency high throughput GenAI inference. Furthermore, inference batching feature for optimizing performance is enabled, along with dynamic inference worker spin-up and spin-down capabilities to reduce power consumption and carbon emissions is also enabled.

__init__(config: InferenceConfig, input_queue) → None [source] 

Initialize a Inference instance.

Parameters:

config (InferenceConfig) – Type-safe configuration object.
input_queue (dragon.native.Queue) – Input queue that feeds user prompts into the backend service.

Methods

`__init__`(config, input_queue)	Initialize a `Inference` instance.
`create_cpu_device_workers_by_node`()	Create CPU-head workers and associated inference workers by node.
`destroy`()	Destroys all spun-up processes and terminates the application.
`get_nodes_in_alloc`()	Get all nodes in current dragon enabled allocation.
`initialize`()	Initializes the backend services to spin up the GenAI inference application.
`maybe_subset_nodes_gpus`()	If the number of Node(s) and GPU(s) specified in config.yaml are different than default (full utilization), subset the nodes and GPUs accordingly.
`query`(q_item)	Queries the dragon-inference application to generate a response from the GenAI model.
`tp_args_validator`(app_gpus)	Validate tensor-parallel size against GPUs available per node.

__init__(config: InferenceConfig, input_queue) → None [source] 

Initialize a Inference instance.

Parameters:

config (InferenceConfig) – Type-safe configuration object.
input_queue (dragon.native.Queue) – Input queue that feeds user prompts into the backend service.

get_nodes_in_alloc()[source] 

Get all nodes in current dragon enabled allocation.

Returns:: Dictionary of all available nodes in the allocation. Keys are hostnames, values are dragon.native.machine.Node objects.
Return type:: dict

maybe_subset_nodes_gpus()[source] 

If the number of Node(s) and GPU(s) specified in config.yaml are different than default (full utilization), subset the nodes and GPUs accordingly.

Returns:: Tuple of (nodes_dict, gpus_per_node) where nodes_dict maps (hostname, Dragon Node) tuples to GPU ranks, and gpus_per_node is the number of GPUs per node after subsetting.
Return type:: tuple [dict , int ]

tp_args_validator(app_gpus)[source] 

Validate tensor-parallel size against GPUs available per node.

Parameters:: app_gpus (int ) – Number of GPUs available in each node.

create_cpu_device_workers_by_node()[source] 

Create CPU-head workers and associated inference workers by node.

Returns:: Tuple (cpu_and_device_proc_by_hostname, num_cpu_procs) where cpu_and_device_proc_by_hostname maps hostnames to dictionaries of CPU-worker IDs and their inference worker configurations, and num_cpu_procs is the total number of CPU processes.
Return type:: tuple [dict , int ]

query(q_item)[source] 

Queries the dragon-inference application to generate a response from the GenAI model.

Parameters:: q_item (tuple ) – Tuple of the form (user_input, response_queue) where user_input is a string or list of strings and response_queue is an dragon.native.Queue used to receive responses.

initialize()[source] : Initializes the backend services to spin up the GenAI inference application.

destroy()[source] : Destroys all spun-up processes and terminates the application.