dragon.ai.inference.llm_engine.LLMInferenceEngine
- class LLMInferenceEngine[source]
Bases:
objectHandles LLM inference using vLLM in a tensor-parallel environment.
This class is responsible ONLY for LLM inference, completely separated from batching, guardrails, and other preprocessing logic.
- __init__(model_config: ModelConfig, batching_config: BatchingConfig, hostname: str , devices: List [int ])[source]
Initialize the LLM inference engine.
- Parameters:
model_config (ModelConfig) – Model configuration.
batching_config (BatchingConfig) – Batching configuration (used for
max_num_seqs).hostname (str ) – Current process hostname.
Methods
__init__(model_config, batching_config, ...)Initialize the LLM inference engine.
generate(prompts[, json_schemas])Generate responses for a batch of prompts.
Return the tokenizer from the underlying vLLM engine.
Initialize the vLLM model and sampling parameters.
shutdown()Shutdown the LLM engine and release resources.
- __init__(model_config: ModelConfig, batching_config: BatchingConfig, hostname: str , devices: List [int ])[source]
Initialize the LLM inference engine.
- Parameters:
model_config (ModelConfig) – Model configuration.
batching_config (BatchingConfig) – Batching configuration (used for
max_num_seqs).hostname (str ) – Current process hostname.
- initialize() None [source]
Initialize the vLLM model and sampling parameters.
This should be called within the worker process to avoid serialization issues with CUDA objects.
- generate(prompts: List [str ], json_schemas: List = None) Tuple [List [str ], Dict [str , float ]][source]
Generate responses for a batch of prompts.
- Parameters:
- Returns:
Tuple
(responses, metrics)whereresponsesis a list of generated strings andmetricsis a dictionary of performance metrics.- Return type:
- get_tokenizer()[source]
Return the tokenizer from the underlying vLLM engine.
Useful for callers that need to
apply_chat_template()before callinggenerate().