dragon.ai.inference.llm_proxy.DragonQueueLLMProxy

class DragonQueueLLMProxy[source]

Bases: LLMProxy

LLM proxy backed by a Dragon IPC queue.

Each chat() call puts an InferenceRequest on input_queue with a per-call response queue drawn from ResponseQueuePool, then awaits until the response arrives.

Concurrency is hard-limited by the pool size: if max_concurrent_requests calls are already in flight, subsequent callers await inside ResponseQueuePool.acquire() until a response queue is returned — no overflow queues are ever created.

Designed to be created per agent process — each agent owns its own proxy and response-queue pool, all pointing at the same shared inference pipeline via input_queue.

Parameters:
  • input_queue (dragon.native.Queue) – Shared request queue consumed by the backend.

  • max_concurrent_requests (int ) – Hard limit on concurrent in-flight requests. Callers beyond this limit await until a slot frees. Defaults to 32.

__init__(input_queue, *, max_concurrent_requests: int = 32) None [source]

Methods

__init__(input_queue, *[, ...])

chat(messages[, tools, json_schema, ...])

Send a chat request via Dragon Queue and return the response.

shutdown()

Destroy all pooled response queues.

Attributes

pool_available

Number of idle response queues available for immediate reuse.

__init__(input_queue, *, max_concurrent_requests: int = 32) None [source]
property pool_available: int

Number of idle response queues available for immediate reuse.

async chat(messages: List [Dict [str , Any ]], tools: List [Dict [str , Any ]] | None = None, json_schema: dict | None = None, continue_final_message: bool = False, *, sampling_params_override=None) str [source]

Send a chat request via Dragon Queue and return the response.

Parameters:
  • messages (list [dict ]) – Conversation messages in OpenAI chat format.

  • tools (list [dict ] | None) – Optional tool definitions.

  • json_schema (dict | None) – JSON schema dict for structured output. When provided, guided decoding is enabled.

  • continue_final_message (bool ) – Continue last assistant message.

  • sampling_params_override (SamplingParams | None) – Explicit SamplingParams override. Takes precedence over json_schema.

Returns:

Response text.

Return type:

str

Raises:

Exception – Re-raises any exception returned by the backend.

async shutdown() None [source]

Destroy all pooled response queues.