dragon.ai.inference.inference_utils.Inference

class Inference[source]

Bases: object

This class is the starting point for initializing the inference pipeline. It provides distributed multi-gpu & multi-node inference capabilities for low-latency high throughput GenAI inference. Furthermore, inference batching feature for optimizing performance is enabled, along with dynamic inference worker spin-up and spin-down capabilities to reduce power consumption and carbon emissions is also enabled.

__init__(config: InferenceConfig, input_queue) None [source]

Initialize a Inference instance.

Parameters:
  • config (InferenceConfig) – Type-safe configuration object.

  • input_queue (dragon.native.Queue) – Input queue that feeds user prompts into the backend service.

Methods

__init__(config, input_queue)

Initialize a Inference instance.

create_cpu_device_workers_by_node()

Create CPU-head workers and associated inference workers by node.

destroy()

Destroys all spun-up processes and terminates the application.

get_nodes_in_alloc()

Get all nodes in current dragon enabled allocation.

initialize()

Initializes the backend services to spin up the GenAI inference application.

maybe_subset_nodes_gpus()

If the number of Node(s) and GPU(s) specified in config.yaml are different than default (full utilization), subset the nodes and GPUs accordingly.

query(q_item)

Queries the dragon-inference application to generate a response from the GenAI model.

tp_args_validator(app_gpus)

Validate tensor-parallel size against GPUs available per node.

__init__(config: InferenceConfig, input_queue) None [source]

Initialize a Inference instance.

Parameters:
  • config (InferenceConfig) – Type-safe configuration object.

  • input_queue (dragon.native.Queue) – Input queue that feeds user prompts into the backend service.

get_nodes_in_alloc()[source]

Get all nodes in current dragon enabled allocation.

Returns:

Dictionary of all available nodes in the allocation. Keys are hostnames, values are dragon.native.machine.Node objects.

Return type:

dict

maybe_subset_nodes_gpus()[source]

If the number of Node(s) and GPU(s) specified in config.yaml are different than default (full utilization), subset the nodes and GPUs accordingly.

Returns:

Dictionary mapping (hostname, Dragon Node) tuples to GPU ranks.

Return type:

dict

tp_args_validator(app_gpus)[source]

Validate tensor-parallel size against GPUs available per node.

Parameters:

app_gpus (int ) – Number of GPUs available in each node.

create_cpu_device_workers_by_node()[source]

Create CPU-head workers and associated inference workers by node.

Returns:

Tuple (cpu_and_device_proc_by_hostname, num_cpu_procs) where cpu_and_device_proc_by_hostname maps hostnames to dictionaries of CPU-worker IDs and their inference worker configurations, and num_cpu_procs is the total number of CPU processes.

Return type:

tuple [dict , int ]

query(q_item)[source]

Queries the dragon-inference application to generate a response from the GenAI model.

Parameters:

q_item (tuple ) – Tuple of the form (user_input, response_queue) where user_input is a string or list of strings and response_queue is an dragon.native.Queue used to receive responses.

initialize()[source]

Initializes the backend services to spin up the GenAI inference application.

destroy()[source]

Destroys all spun-up processes and terminates the application.