.. _InferenceAPI: Inference +++++++++ The Dragon Inference module provides distributed, multi-GPU and multi-node LLM inference capabilities for low-latency, high-throughput generative AI workloads on HPC clusters. It features a pull-based distributed load balancing component managed through RDMA-enabled shared Dragon Queues. The module also incorporates dynamic batching of inference requests, optional prompt guardrails, and a tensor-parallelized vLLM backend with Dragon's process and communication primitives. .. note:: This module is experimental and not yet in its final state. See the ``src/dragon/ai/inference/README.md`` for installation and configuration instructions. Python Reference ================ Core ---- Entry point for initializing and launching the full inference pipeline across nodes and GPUs. .. currentmodule:: dragon.ai.inference.inference_utils .. autosummary:: :toctree: :recursive: Inference Configuration ------------- Type-safe dataclasses covering hardware allocation, model parameters, batching, guardrails, dynamic worker management, and the top-level composite config. .. currentmodule:: dragon.ai.inference.config .. autosummary:: :toctree: :recursive: InferenceConfig HardwareConfig ModelConfig BatchingConfig GuardrailsConfig DynamicWorkerConfig LLM Proxy --------- Transport-agnostic interface for sending chat requests to the inference backend, with a Dragon queue-backed implementation and a reusable response-queue pool. .. currentmodule:: dragon.ai.inference.llm_proxy .. autosummary:: :toctree: :recursive: LLMProxy DragonQueueLLMProxy InferenceRequest ResponseQueuePool Batching -------- Dynamic request batching: individual request items, assembled batches, and the batcher that collects prompts over a configurable time window. .. currentmodule:: dragon.ai.inference.batching .. autosummary:: :toctree: :recursive: DynamicBatcher Batch BatchItem Guardrails ---------- Prompt safety checking using the PromptGuard model, separated from the main inference logic. .. currentmodule:: dragon.ai.inference.guardrails .. autosummary:: :toctree: :recursive: GuardrailsProcessor .. currentmodule:: dragon.ai.inference.prompt_guard_utils .. autosummary:: :toctree: :recursive: PromptGuard LLM Engine ---------- vLLM-based inference engine and supporting utilities for chat-template formatting and port allocation. .. currentmodule:: dragon.ai.inference.llm_engine .. autosummary:: :toctree: :recursive: LLMInferenceEngine chat_template_formatter find_free_port Workers ------- GPU inference workers and the CPU head worker that monitors concurrency and dynamically spins inference workers up and down. .. currentmodule:: dragon.ai.inference.inference_worker_utils .. autosummary:: :toctree: :recursive: InferenceWorker .. currentmodule:: dragon.ai.inference.cpu_worker_utils .. autosummary:: :toctree: :recursive: CPUWorker Reader and Metrics ------------------ Response collection from the output queue and latency/throughput metrics consolidation. .. currentmodule:: dragon.ai.inference.reader_utils .. autosummary:: :toctree: :recursive: ReadWorker MetricsConsolidator