dragon.ai.inference.config.ModelConfig

class ModelConfig[source] 

Bases: object

LLM model and generation configuration.

Describes the model to load, tensor parallelism, tokenizer behavior, and default sampling parameters used by the vLLM backend.

Parameters:

model_name (str ) – Hugging Face model name or local model directory.
hf_token (str ) – Hugging Face token used when loading the model and tokenizer. A token string is required by the configuration even when the model is local.
tp_size (int ) – Tensor-parallel size. Each inference worker consumes this many GPUs.
dtype (str ) – Model precision passed to vLLM, such as "bfloat16" or "float16".
max_tokens (int ) – Maximum number of new tokens to generate per request.
max_model_len (int ) – Maximum model context length, including prompt and generated tokens.
padding_side (str ) – Tokenizer padding side for prompt formatting.
truncation_side (str ) – Tokenizer truncation side for prompt formatting.
top_k (int ) – Number of highest-probability tokens kept for top-k sampling.
top_p (float ) – Nucleus sampling threshold in the range [0.0, 1.0].
temperature (float ) – Sampling temperature controlling randomness. 0.0 yields greedy decoding; higher values increase randomness.
repetition_penalty (float ) – Penalty applied to previously generated tokens to discourage repetition. Values greater than 1.0 penalize repetition.
ignore_eos (bool ) – If True, generation continues after the EOS token is produced instead of stopping.
skip_special_tokens (bool ) – If True, special tokens are removed from the generated output text.
system_prompt (list [str ]) – System instructions used by the direct dragon.ai.inference.Inference.query() path.
vllm_log_level (str ) – vLLM logging level, for example "error" or "info".
gpu_memory_utilization (float ) – Fraction of GPU memory vLLM uses for model weights and KV cache. Range is (0, 1].

__init__(model_name: str , hf_token: str , tp_size: int , dtype: str = 'bfloat16', max_tokens: int = 100, max_model_len: int = 8192, padding_side: str = 'left', truncation_side: str = 'left', top_k: int = 50, top_p: float = 0.95, temperature: float = 0.5, repetition_penalty: float = 1.1, ignore_eos: bool = False, skip_special_tokens: bool = False, system_prompt: List [str ] = <factory>, vllm_log_level: str = 'error', gpu_memory_utilization: float = 0.95, use_async_streaming: bool = False) → None 

Methods

`__init__`(model_name, hf_token, tp_size, ...)
`validate`(gpus_per_node)	Validate model configuration.

Attributes

`dtype`
`gpu_memory_utilization`
`ignore_eos`
`max_model_len`
`max_tokens`
`padding_side`
`repetition_penalty`
`skip_special_tokens`
`temperature`
`top_k`
`top_p`
`truncation_side`
`use_async_streaming`
`vllm_log_level`
`model_name`
`hf_token`
`tp_size`
`system_prompt`

model_name: str 

hf_token: str 

tp_size: int 

dtype: str = 'bfloat16'

max_tokens: int = 100

max_model_len: int = 8192

padding_side: str = 'left'

truncation_side: str = 'left'

top_k: int = 50

top_p: float = 0.95

temperature: float = 0.5

repetition_penalty: float = 1.1

ignore_eos: bool = False

skip_special_tokens: bool = False

system_prompt: List [str ]

vllm_log_level: str = 'error'

gpu_memory_utilization: float = 0.95

use_async_streaming: bool = False

validate(gpus_per_node: int ) → None [source] 

Validate model configuration.

Parameters:: gpus_per_node (int ) – Number of GPUs available per node.
Raises:: ValueError – If any configuration parameter is invalid.

__init__(model_name: str , hf_token: str , tp_size: int , dtype: str = 'bfloat16', max_tokens: int = 100, max_model_len: int = 8192, padding_side: str = 'left', truncation_side: str = 'left', top_k: int = 50, top_p: float = 0.95, temperature: float = 0.5, repetition_penalty: float = 1.1, ignore_eos: bool = False, skip_special_tokens: bool = False, system_prompt: List [str ] = <factory>, vllm_log_level: str = 'error', gpu_memory_utilization: float = 0.95, use_async_streaming: bool = False) → None 