dragon.ai.inference.llm_engine.LLMInferenceEngine

class LLMInferenceEngine[source] 

Bases: object

Handles LLM inference using vLLM in a tensor-parallel environment.

This class is responsible ONLY for LLM inference, completely separated from batching, guardrails, and other preprocessing logic.

Engine Modes

The engine supports two operational modes controlled by the use_async_streaming config flag:

Synchronous Mode (use_async_streaming: false, default): Uses vLLM’s synchronous LLM engine. Best for batch workloads. Call generate() for single or batched requests.
Async Streaming Mode (use_async_streaming: true): Uses vLLM’s V1 AsyncLLM engine for token-by-token streaming. Best for interactive workloads requiring low latency. Call generate_stream() for streaming or generate_single() for non-streaming responses.

Note

Async streaming and batching are mutually exclusive. Configuration validation will reject use_async_streaming: true with input_batching.toggle_on: true.

Request Routing

use_async_streaming``\| ``stream		Method
false	false	`generate()`
false	true	Error
true	false	`generate_single()`
true	true	`generate_stream()`

__init__(model_config: ModelConfig, batching_config: BatchingConfig, hostname: str , devices: List [int ])[source] 

Initialize the LLM inference engine.

Parameters:

model_config (ModelConfig) – Model configuration.
batching_config (BatchingConfig) – Batching configuration (used for max_num_seqs).
hostname (str ) – Current process hostname.
devices (list [int ]) – List of GPU device IDs.

Methods

`__init__`(model_config, batching_config, ...)	Initialize the LLM inference engine.
`generate`(prompts[, json_schemas])	Generate responses for a batch of prompts.
`generate_single`(prompt[, json_schema])	Generate a complete response for a single prompt (non-streaming).
`generate_stream`(prompt[, json_schema])	Generate a streaming response for a single prompt.
`get_tokenizer`()	Return the tokenizer from the underlying vLLM engine.
`initialize`()	Initialize the vLLM model and sampling parameters.
`shutdown`()	Shutdown the LLM engine and release resources.

__init__(model_config: ModelConfig, batching_config: BatchingConfig, hostname: str , devices: List [int ])[source] 

Initialize the LLM inference engine.

Parameters:

model_config (ModelConfig) – Model configuration.
batching_config (BatchingConfig) – Batching configuration (used for max_num_seqs).
hostname (str ) – Current process hostname.
devices (list [int ]) – List of GPU device IDs.

initialize() → None [source] 

Initialize the vLLM model and sampling parameters.

This should be called within the worker process to avoid serialization issues with CUDA objects.

generate(prompts: List [str ], json_schemas: List = None) → Tuple [List [str ], Dict [str , float ]][source] 

Generate responses for a batch of prompts.

Parameters:

prompts (list [str ]) – List of formatted prompts.
json_schemas (list [dict | None] | None) – Per-prompt JSON schema for guided decoding. A list the same length as prompts where each element is either a dict (enable guided decoding for that prompt) or None (free-form generation). Pass None to use free-form generation for every prompt.

Returns:

Tuple (responses, metrics) where responses is a list of generated strings and metrics is a dictionary of performance metrics.

Return type:

tuple [list [str ], dict [str , float ]]

generate_single(prompt: str , json_schema: dict | None = None) → Tuple [str , Dict [str , float ]][source] 

Generate a complete response for a single prompt (non-streaming).

This method is used when the AsyncLLM engine is active but the HTTP request has stream=false. It runs the async generator to completion and returns the final response.

Parameters:

prompt (str ) – Formatted prompt string ready for the LLM.
json_schema (dict | None) – Optional JSON schema for guided decoding.

Returns:

Tuple (response, metrics) where response is the complete generated string and metrics is a dictionary of performance metrics.

Return type:

tuple [str , dict [str , float ]]

Raises:

RuntimeError – If AsyncLLM engine is not available.

generate_stream(prompt: str , json_schema: dict | None = None) → Iterator [StreamChunk][source] 

Generate a streaming response for a single prompt.

Yields StreamChunk objects as tokens are generated. The final chunk has is_finished=True and includes metrics.

Requires use_async_streaming=True in config to enable the AsyncLLM (V1 engine) backend. Streaming is not available when using the synchronous LLMEngine.

Streaming is single-request only (no batching) to ensure low latency token delivery.

Parameters:

prompt (str ) – Formatted prompt string ready for the LLM.
json_schema (dict | None) – Optional JSON schema for guided decoding.

Yields:

StreamChunk objects containing incremental text.

Return type:

Iterator[StreamChunk]

Raises:

RuntimeError – If initialize() has not been called or if AsyncLLM is not available.

get_tokenizer()[source] 

Return the tokenizer from the underlying vLLM engine.

The inference worker uses this to apply a model-specific chat template for requests that arrive as OpenAI-style message lists. The tokenizer is only available after initialize() has constructed the vLLM engine instance.

Returns:: Tokenizer owned by the vLLM engine.
Return type:: transformers.PreTrainedTokenizerBase
Raises:: RuntimeError – If initialize() has not been called.

shutdown() → None [source] : Shutdown the LLM engine and release resources.