Inference

The Dragon Inference module provides distributed, multi-GPU and multi-node LLM inference capabilities for low-latency, high-throughput generative AI workloads on HPC clusters. It features a pull-based distributed load balancing component managed through RDMA-enabled shared Dragon Queues. The module also incorporates dynamic batching of inference requests, optional prompt guardrails, and a tensor-parallelized vLLM backend with Dragon’s process and communication primitives.

Note

This module is experimental and not yet in its final state. See the Dragon Inference Service - User Guide for installation and configuration instructions, Inference Service Examples for examples, and Inference Service for implementation details.

User code should import the main service, configuration dataclasses, and queue proxy from dragon.ai.inference. The submodule references below document the implementation modules that define those public objects.

Python Reference

Core

Entry point for initializing and launching the full inference pipeline across nodes and GPUs.

Inference

This class is the starting point for initializing the inference pipeline.

Configuration

Type-safe dataclasses covering hardware allocation, model parameters, batching, guardrails, dynamic worker management, and the top-level composite config.

`InferenceConfig`	Master configuration for the entire inference pipeline.
`HardwareConfig`	Hardware allocation and resource configuration.
`ModelConfig`	LLM model and generation configuration.
`BatchingConfig`	Request batching configuration.
`GuardrailsConfig`	Prompt guardrails configuration.
`DynamicWorkerConfig`	Dynamic inference worker spin-up and spin-down configuration.

LLM Proxy

Transport-agnostic interface for sending chat requests to the inference backend, with a Dragon queue-backed implementation and a reusable response-queue pool.

`LLMProxy`	Transport-agnostic proxy interface for LLM chat inference.
`DragonQueueLLMProxy`	LLM proxy backed by a Dragon IPC queue.
`InferenceRequest`	Typed request sent through the inference input queue.
`ResponseQueuePool`	Bounded pool of reusable, minimal Dragon Queues.

Batching

Dynamic request batching: individual request items, assembled batches, and the batcher that collects prompts over a configurable time window.

`DynamicBatcher`	Dynamic batching component that collects prompts over a time window and forwards batched inputs for processing.
`Batch`	A collection of items to be processed together.
`BatchItem`	A single item to be batched.

Guardrails

Prompt safety checking using the PromptGuard model, separated from the main inference logic.

GuardrailsProcessor

Handles prompt safety checking using PromptGuard model.

PromptGuard

Evaluate text with a PromptGuard jailbreak/injection classifier.

LLM Engine

vLLM-based inference engine and supporting utilities for chat-template formatting, port allocation, and streaming generation.

`LLMInferenceEngine`	Handles LLM inference using vLLM in a tensor-parallel environment.
`StreamChunk`	A single chunk yielded during streaming generation.
`StreamDoneSentinel`	Sentinel object indicating the end of a streaming response.
`STREAM_DONE_SENTINEL`	Sentinel object indicating the end of a streaming response.
`chat_template_formatter`(system_prompt, ...)	Format the prompt using the model's chat template via its tokenizer.
`find_free_port`([device_index, base_port, ...])	Return an available TCP port using the worker's device index as seed.

Async Streaming Mode

The inference engine supports two operational modes controlled by the use_async_streaming configuration flag:

Synchronous Mode (default, use_async_streaming: false):: Uses vLLM’s synchronous LLM engine. Suitable for batch processing workloads where multiple requests are accumulated and processed together. Requests with stream: true will return an error in this mode.
Async Streaming Mode (use_async_streaming: true):: Uses vLLM’s V1 AsyncLLM engine for token-by-token streaming. Suitable for interactive workloads requiring low time-to-first-token latency. Supports both streaming (stream: true) and non-streaming (stream: false) requests.

Note

Async streaming mode and batch mode are mutually exclusive. If use_async_streaming: true, then input_batching.toggle_on must be false. The configuration validator will raise an error if both are enabled.

Configuration example:

llm:
  model_name: "meta-llama/Llama-3.1-8B-Instruct"
  use_async_streaming: true  # Enable AsyncLLM for streaming
  # ... other model config

input_batching:
  toggle_on: false  # Must be false when streaming is enabled

Request routing based on configuration and request flags:

`use_async_streaming`	Request ``stream``\| Method Called		Result
`false`	`false`	`generate()`	Sync batch/single response
`false`	`true`	Error	Streaming not available
`true`	`false`	``generate_single()``\| Complete response
`true`	`true`	``generate_stream()``\| Chunked token stream

Workers

GPU inference workers and the CPU head worker that monitors concurrency and dynamically spins inference workers up and down.

InferenceWorker

The Inference Worker class orchestrates the pre-processing module & the main LLM inference module that is tensor-parallelized with batch-processing.

CPUWorker

The CPU worker class monitors prompt concurrency to dynamically spin-up and spin-down of inference workers assigned to it.

Reader and Metrics

Response collection from the output queue and latency/throughput metrics consolidation.

`ReadWorker`	Read raw inference responses from a queue for simple drivers.
`MetricsConsolidator`	Read response metrics, aggregate them, and write them to Excel.