Inference

The Dragon Inference module provides distributed, multi-GPU and multi-node LLM inference capabilities for low-latency, high-throughput generative AI workloads on HPC clusters. It features a pull-based distributed load balancing component managed through RDMA-enabled shared Dragon Queues. The module also incorporates dynamic batching of inference requests, optional prompt guardrails, and a tensor-parallelized vLLM backend with Dragon’s process and communication primitives.

Note

This module is experimental and not yet in its final state. See the src/dragon/ai/inference/README.md for installation and configuration instructions.

Python Reference

Core

Entry point for initializing and launching the full inference pipeline across nodes and GPUs.

Inference

This class is the starting point for initializing the inference pipeline.

Configuration

Type-safe dataclasses covering hardware allocation, model parameters, batching, guardrails, dynamic worker management, and the top-level composite config.

InferenceConfig

Master configuration for the entire inference pipeline.

HardwareConfig

Hardware allocation and resource configuration.

ModelConfig

LLM model configuration.

BatchingConfig

Dynamic batching configuration.

GuardrailsConfig

Prompt guardrails/safety configuration.

DynamicWorkerConfig

Dynamic inference worker spin-up/down configuration.

LLM Proxy

Transport-agnostic interface for sending chat requests to the inference backend, with a Dragon queue-backed implementation and a reusable response-queue pool.

LLMProxy

Transport-agnostic proxy interface for LLM chat inference.

DragonQueueLLMProxy

LLM proxy backed by a Dragon IPC queue.

InferenceRequest

Typed request sent through the inference input queue.

ResponseQueuePool

Bounded pool of reusable, minimal Dragon Queues.

Batching

Dynamic request batching: individual request items, assembled batches, and the batcher that collects prompts over a configurable time window.

DynamicBatcher

Dynamic batching component that collects prompts over a time window and forwards batched inputs for processing.

Batch

A collection of items to be processed together.

BatchItem

A single item to be batched.

Guardrails

Prompt safety checking using the PromptGuard model, separated from the main inference logic.

GuardrailsProcessor

Handles prompt safety checking using PromptGuard model.

PromptGuard

Utilities for loading the PromptGuard model and evaluating text for jailbreaks and indirect injections.

LLM Engine

vLLM-based inference engine and supporting utilities for chat-template formatting and port allocation.

LLMInferenceEngine

Handles LLM inference using vLLM in a tensor-parallel environment.

chat_template_formatter(system_prompt, ...)

Format the prompt using the model's chat template via its tokenizer.

find_free_port([device_index, base_port, ...])

Return an available TCP port using the worker's device index as seed.

Workers

GPU inference workers and the CPU head worker that monitors concurrency and dynamically spins inference workers up and down.

InferenceWorker

The Inference Worker class orchestrates the pre-processing module & the main LLM inference module that is tensor-parallelized with batch-processing.

CPUWorker

The CPU worker class monitors prompt concurrency to dynamically spin-up and spin-down of inference workers assigned to it.

Reader and Metrics

Response collection from the output queue and latency/throughput metrics consolidation.

ReadWorker

Handles reading from the response queue and returning the response to the front end.

MetricsConsolidator

Handles reading from the response queue and returning the response to the front end.