dragon.ai.inference.inference_worker_utils.InferenceWorker

class InferenceWorker[source]

Bases: object

The Inference Worker class orchestrates the pre-processing module & the main LLM inference module that is tensor-parallelized with batch-processing.

Architecture: 1. Batching Module (DynamicBatcher): Batching logic 2. Guardrails Module (GuardrailsProcessor): Optional safety filtering 3. LLM Inference Module: vLLM-based inference

__init__(end_event, model_config: ModelConfig, batching_config: BatchingConfig, guardrails_config: GuardrailsConfig, dynamic_worker_config: DynamicWorkerConfig, dt, hostname: str = None, devices: list = None, head_cpu_pid: int = None, inf_wrkr_id: int = None, preprocessing_input_queue=None, preprocessing_output_queue=None, inf_wrkr_barrier=None, llm_proc_end_ev=None, inf_wrkr_down_ev=None, inf_wrkr_manager_q=None) None [source]

Initialize an inference worker instance.

Parameters:
  • end_event (dragon.native.Event) – Primary event that terminates all processes.

  • model_config (ModelConfig) – Model configuration.

  • batching_config (BatchingConfig) – Batching configuration.

  • guardrails_config (GuardrailsConfig) – Guardrails/safety configuration.

  • dynamic_worker_config (DynamicWorkerConfig) – Dynamic worker configuration.

  • dt (dragon.telemetry.telemetry.Telemetry) – Dragon telemetry object.

  • hostname (str ) – Current process hostname.

  • devices (list [int ]) – List of GPU ranks for the current inference worker.

  • head_cpu_pid (int ) – Head CPU worker PID for the current inference worker.

  • inf_wrkr_id (int ) – Unique identifier for the current inference worker.

  • preprocessing_input_queue (dragon.native.Queue) – Input queue for the preprocessing worker.

  • preprocessing_output_queue (dragon.native.Queue) – Output queue for the preprocessing worker.

  • inf_wrkr_barrier (dragon.native.Barrier) – Barrier used to wait until all inference worker modules are ready.

  • llm_proc_end_ev (dragon.native.Event) – Event used to denote that the LLM module should spin down.

  • inf_wrkr_down_ev (dragon.native.Event) – Event used to denote that the entire inference worker should tear down.

  • inf_wrkr_manager_q (dragon.native.Queue) – Queue of tuples of the form (hostname, devices, inf_wrkr_id).

Methods

__init__(end_event, model_config, ...[, ...])

Initialize an inference worker instance.

filter_with_guardrails(formatted_prompts, ...)

Apply guardrails filtering and return safe prompts and metrics.

llm_inference_entry_point(inference_worker_args)

Entry point for the LLM inference worker process.

preprocessing_entry_point(inference_worker_args)

Entry point for the preprocessing worker process.

process_prebatched(guardrails)

Process pre-batched inputs with optional guardrails filtering.

process_single_prompts(guardrails)

Process individual prompts without batching (batch_size=1).

process_with_batching(batcher, guardrails)

Process inputs with dynamic batching and optional guardrails filtering.

run_llm_inference_module()

The LLM inference module orchestrates GPUs in a tensor-parallel environment to perform batch inference using vLLM.

run_pre_processing_module()

The pre-processing module performs batching and optional guardrails filtering.

static preprocessing_entry_point(inference_worker_args)[source]

Entry point for the preprocessing worker process.

Parameters:

inference_worker_args (dict ) – Arguments for initializing the InferenceWorker, including runtime parameters such as hostname, devices and queues.

static llm_inference_entry_point(inference_worker_args)[source]

Entry point for the LLM inference worker process.

Parameters:

inference_worker_args (dict ) – Arguments for initializing the InferenceWorker, including runtime parameters such as hostname, devices and queues.

__init__(end_event, model_config: ModelConfig, batching_config: BatchingConfig, guardrails_config: GuardrailsConfig, dynamic_worker_config: DynamicWorkerConfig, dt, hostname: str = None, devices: list = None, head_cpu_pid: int = None, inf_wrkr_id: int = None, preprocessing_input_queue=None, preprocessing_output_queue=None, inf_wrkr_barrier=None, llm_proc_end_ev=None, inf_wrkr_down_ev=None, inf_wrkr_manager_q=None) None [source]

Initialize an inference worker instance.

Parameters:
  • end_event (dragon.native.Event) – Primary event that terminates all processes.

  • model_config (ModelConfig) – Model configuration.

  • batching_config (BatchingConfig) – Batching configuration.

  • guardrails_config (GuardrailsConfig) – Guardrails/safety configuration.

  • dynamic_worker_config (DynamicWorkerConfig) – Dynamic worker configuration.

  • dt (dragon.telemetry.telemetry.Telemetry) – Dragon telemetry object.

  • hostname (str ) – Current process hostname.

  • devices (list [int ]) – List of GPU ranks for the current inference worker.

  • head_cpu_pid (int ) – Head CPU worker PID for the current inference worker.

  • inf_wrkr_id (int ) – Unique identifier for the current inference worker.

  • preprocessing_input_queue (dragon.native.Queue) – Input queue for the preprocessing worker.

  • preprocessing_output_queue (dragon.native.Queue) – Output queue for the preprocessing worker.

  • inf_wrkr_barrier (dragon.native.Barrier) – Barrier used to wait until all inference worker modules are ready.

  • llm_proc_end_ev (dragon.native.Event) – Event used to denote that the LLM module should spin down.

  • inf_wrkr_down_ev (dragon.native.Event) – Event used to denote that the entire inference worker should tear down.

  • inf_wrkr_manager_q (dragon.native.Queue) – Queue of tuples of the form (hostname, devices, inf_wrkr_id).

run_pre_processing_module()[source]

The pre-processing module performs batching and optional guardrails filtering.

Architecture: 1. Batching: Collect prompts into batches (DynamicBatcher) 2. Guardrails (optional): Filter malicious prompts (GuardrailsProcessor) 3. Forward to LLM: Send safe batches to LLM inference module

Uses instance attributes set in __init__:

self.hostname, self.devices, self.preprocessing_input_queue, self.preprocessing_output_queue, self.head_cpu_pid, self.inf_wrkr_barrier, self.llm_proc_end_ev, self.inf_wrkr_id

process_single_prompts(guardrails)[source]

Process individual prompts without batching (batch_size=1).

Architecture: 1. Read single prompt from input queue 2. Optionally filter through GuardrailsProcessor 3. Forward to LLM module immediately (no batching)

Parameters:

guardrails (GuardrailsProcessor or None) – Optional GuardrailsProcessor instance.

process_prebatched(guardrails)[source]

Process pre-batched inputs with optional guardrails filtering.

Architecture: 1. Receive already-batched inputs 2. Optionally filter through GuardrailsProcessor 3. Forward to LLM module

Parameters:

guardrails (GuardrailsProcessor or None) – Optional GuardrailsProcessor instance.

process_with_batching(batcher, guardrails)[source]

Process inputs with dynamic batching and optional guardrails filtering.

Architecture: 1. Read from input queue 2. Add to DynamicBatcher produces Batch when ready 3. Optionally filter Batch through GuardrailsProcessor 4. Forward to LLM module

All request types (text and chat) flow through the single DynamicBatcher. Per-request fields (tools, json_schema_override, continue_final_message) are carried along as optional BatchItem attributes.

Parameters:
filter_with_guardrails(formatted_prompts: list , user_prompts: list , response_queues: list , latency_metrics: list , guardrails: GuardrailsProcessor | None )[source]

Apply guardrails filtering and return safe prompts and metrics.

Parameters:
  • formatted_prompts (list ) – List of formatted prompts for the LLM.

  • user_prompts (list ) – List of original user prompts.

  • response_queues (list ) – List of response queues for each prompt.

  • latency_metrics (list ) – List of latency metrics for each prompt.

  • guardrails (GuardrailsProcessor or None) – GuardrailsProcessor instance, or None if guardrails are disabled.

Returns:

Tuple (safe_formatted_prompts, safe_user_prompts, safe_response_queues, safe_latency_metrics, preprocessing_time, malicious_indices).

Return type:

tuple

run_llm_inference_module()[source]

The LLM inference module orchestrates GPUs in a tensor-parallel environment to perform batch inference using vLLM.

Uses instance attributes set in __init__:

self.hostname, self.head_cpu_pid, self.devices, self.preprocessing_input_queue (used as read_from_queue), self.inf_wrkr_barrier, self.llm_proc_end_ev, self.inf_wrkr_down_ev, self.inf_wrkr_manager_q, self.inf_wrkr_id