Inference
The Dragon Inference module provides distributed, multi-GPU and multi-node LLM inference capabilities for low-latency, high-throughput generative AI workloads on HPC clusters. It features a pull-based distributed load balancing component managed through RDMA-enabled shared Dragon Queues. The module also incorporates dynamic batching of inference requests, optional prompt guardrails, and a tensor-parallelized vLLM backend with Dragon’s process and communication primitives.
Note
This module is experimental and not yet in its final state. See the
src/dragon/ai/inference/README.md for installation and configuration
instructions.
Python Reference
Core
Entry point for initializing and launching the full inference pipeline across nodes and GPUs.
This class is the starting point for initializing the inference pipeline. |
Configuration
Type-safe dataclasses covering hardware allocation, model parameters, batching, guardrails, dynamic worker management, and the top-level composite config.
Master configuration for the entire inference pipeline. |
|
Hardware allocation and resource configuration. |
|
LLM model configuration. |
|
Dynamic batching configuration. |
|
Prompt guardrails/safety configuration. |
|
Dynamic inference worker spin-up/down configuration. |
LLM Proxy
Transport-agnostic interface for sending chat requests to the inference backend, with a Dragon queue-backed implementation and a reusable response-queue pool.
Transport-agnostic proxy interface for LLM chat inference. |
|
LLM proxy backed by a Dragon IPC queue. |
|
Typed request sent through the inference input queue. |
|
Bounded pool of reusable, minimal Dragon Queues. |
Batching
Dynamic request batching: individual request items, assembled batches, and the batcher that collects prompts over a configurable time window.
Dynamic batching component that collects prompts over a time window and forwards batched inputs for processing. |
|
A collection of items to be processed together. |
|
A single item to be batched. |
Guardrails
Prompt safety checking using the PromptGuard model, separated from the main inference logic.
Handles prompt safety checking using PromptGuard model. |
Utilities for loading the PromptGuard model and evaluating text for jailbreaks and indirect injections. |
LLM Engine
vLLM-based inference engine and supporting utilities for chat-template formatting and port allocation.
Handles LLM inference using vLLM in a tensor-parallel environment. |
|
|
Format the prompt using the model's chat template via its tokenizer. |
|
Return an available TCP port using the worker's device index as seed. |
Workers
GPU inference workers and the CPU head worker that monitors concurrency and dynamically spins inference workers up and down.
The Inference Worker class orchestrates the pre-processing module & the main LLM inference module that is tensor-parallelized with batch-processing. |
The CPU worker class monitors prompt concurrency to dynamically spin-up and spin-down of inference workers assigned to it. |
Reader and Metrics
Response collection from the output queue and latency/throughput metrics consolidation.
Handles reading from the response queue and returning the response to the front end. |
|
Handles reading from the response queue and returning the response to the front end. |