dragon.ai.inference.llm_proxy.DragonQueueLLMProxy
- class DragonQueueLLMProxy[source]
Bases:
LLMProxyLLM proxy backed by a Dragon IPC queue.
Each
chat()call puts anInferenceRequeston input_queue with a per-call response queue drawn fromResponseQueuePool, then awaits until the response arrives.Concurrency is hard-limited by the pool size: if max_concurrent_requests calls are already in flight, subsequent callers await inside
ResponseQueuePool.acquire()until a response queue is returned — no overflow queues are ever created.Designed to be created per agent process — each agent owns its own proxy and response-queue pool, all pointing at the same shared inference pipeline via input_queue.
- Parameters:
input_queue (dragon.native.Queue) – Shared request queue consumed by the backend.
max_concurrent_requests (int ) – Hard limit on concurrent in-flight requests. Callers beyond this limit await until a slot frees. Defaults to
32.
Methods
__init__(input_queue, *[, ...])chat(messages[, tools, json_schema, ...])Send a chat request via Dragon Queue and return the response.
shutdown()Destroy all pooled response queues.
Attributes
Number of idle response queues available for immediate reuse.
- async chat(messages: List [Dict [str , Any ]], tools: List [Dict [str , Any ]] | None = None, json_schema: dict | None = None, continue_final_message: bool = False, *, sampling_params_override=None) str [source]
Send a chat request via Dragon Queue and return the response.
- Parameters:
messages (list [dict ]) – Conversation messages in OpenAI chat format.
json_schema (dict | None) – JSON schema dict for structured output. When provided, guided decoding is enabled.
continue_final_message (bool ) – Continue last assistant message.
sampling_params_override (SamplingParams | None) – Explicit
SamplingParamsoverride. Takes precedence over json_schema.
- Returns:
Response text.
- Return type:
- Raises:
Exception – Re-raises any exception returned by the backend.