dragon.ai.inference.llm_engine.LLMInferenceEngine

class LLMInferenceEngine[source]

Bases: object

Handles LLM inference using vLLM in a tensor-parallel environment.

This class is responsible ONLY for LLM inference, completely separated from batching, guardrails, and other preprocessing logic.

__init__(model_config: ModelConfig, batching_config: BatchingConfig, hostname: str , devices: List [int ])[source]

Initialize the LLM inference engine.

Parameters:
  • model_config (ModelConfig) – Model configuration.

  • batching_config (BatchingConfig) – Batching configuration (used for max_num_seqs).

  • hostname (str ) – Current process hostname.

  • devices (list [int ]) – List of GPU device IDs.

Methods

__init__(model_config, batching_config, ...)

Initialize the LLM inference engine.

generate(prompts[, json_schemas])

Generate responses for a batch of prompts.

get_tokenizer()

Return the tokenizer from the underlying vLLM engine.

initialize()

Initialize the vLLM model and sampling parameters.

shutdown()

Shutdown the LLM engine and release resources.

__init__(model_config: ModelConfig, batching_config: BatchingConfig, hostname: str , devices: List [int ])[source]

Initialize the LLM inference engine.

Parameters:
  • model_config (ModelConfig) – Model configuration.

  • batching_config (BatchingConfig) – Batching configuration (used for max_num_seqs).

  • hostname (str ) – Current process hostname.

  • devices (list [int ]) – List of GPU device IDs.

initialize() None [source]

Initialize the vLLM model and sampling parameters.

This should be called within the worker process to avoid serialization issues with CUDA objects.

generate(prompts: List [str ], json_schemas: List = None) Tuple [List [str ], Dict [str , float ]][source]

Generate responses for a batch of prompts.

Parameters:
  • prompts (list [str ]) – List of formatted prompts.

  • json_schemas (list [dict | None] | None) – Per-prompt JSON schema for guided decoding. A list the same length as prompts where each element is either a dict (enable guided decoding for that prompt) or None (free-form generation). Pass None to use free-form generation for every prompt.

Returns:

Tuple (responses, metrics) where responses is a list of generated strings and metrics is a dictionary of performance metrics.

Return type:

tuple [list [str ], dict [str , float ]]

get_tokenizer()[source]

Return the tokenizer from the underlying vLLM engine.

Useful for callers that need to apply_chat_template() before calling generate().

shutdown() None [source]

Shutdown the LLM engine and release resources.