dragon.ai.inference.inference_worker_utils.InferenceWorker
- class InferenceWorker[source]
Bases:
objectThe Inference Worker class orchestrates the pre-processing module & the main LLM inference module that is tensor-parallelized with batch-processing.
Architecture: 1. Batching Module (DynamicBatcher): Batching logic 2. Guardrails Module (GuardrailsProcessor): Optional safety filtering 3. LLM Inference Module: vLLM-based inference
- __init__(end_event, model_config: ModelConfig, batching_config: BatchingConfig, guardrails_config: GuardrailsConfig, dynamic_worker_config: DynamicWorkerConfig, dt, hostname: str = None, devices: list = None, head_cpu_pid: int = None, inf_wrkr_id: int = None, preprocessing_input_queue=None, preprocessing_output_queue=None, inf_wrkr_barrier=None, llm_proc_end_ev=None, inf_wrkr_down_ev=None, inf_wrkr_manager_q=None) None [source]
Initialize an inference worker instance.
- Parameters:
end_event (dragon.native.Event) – Primary event that terminates all processes.
model_config (ModelConfig) – Model configuration.
batching_config (BatchingConfig) – Batching configuration.
guardrails_config (GuardrailsConfig) – Guardrails/safety configuration.
dynamic_worker_config (DynamicWorkerConfig) – Dynamic worker configuration.
dt (dragon.telemetry.telemetry.Telemetry) – Dragon telemetry object.
hostname (str ) – Current process hostname.
devices (list [int ]) – List of GPU ranks for the current inference worker.
head_cpu_pid (int ) – Head CPU worker PID for the current inference worker.
inf_wrkr_id (int ) – Unique identifier for the current inference worker.
preprocessing_input_queue (dragon.native.Queue) – Input queue for the preprocessing worker.
preprocessing_output_queue (dragon.native.Queue) – Output queue for the preprocessing worker.
inf_wrkr_barrier (dragon.native.Barrier) – Barrier used to wait until all inference worker modules are ready.
llm_proc_end_ev (dragon.native.Event) – Event used to denote that the LLM module should spin down.
inf_wrkr_down_ev (dragon.native.Event) – Event used to denote that the entire inference worker should tear down.
inf_wrkr_manager_q (dragon.native.Queue) – Queue of tuples of the form
(hostname, devices, inf_wrkr_id).
Methods
__init__(end_event, model_config, ...[, ...])Initialize an inference worker instance.
filter_with_guardrails(formatted_prompts, ...)Apply guardrails filtering and return safe prompts and metrics.
llm_inference_entry_point(inference_worker_args)Entry point for the LLM inference worker process.
preprocessing_entry_point(inference_worker_args)Entry point for the preprocessing worker process.
process_prebatched(guardrails)Process pre-batched inputs with optional guardrails filtering.
process_single_prompts(guardrails)Process individual prompts without batching (batch_size=1).
process_with_batching(batcher, guardrails)Process inputs with dynamic batching and optional guardrails filtering.
The LLM inference module orchestrates GPUs in a tensor-parallel environment to perform batch inference using vLLM.
The pre-processing module performs batching and optional guardrails filtering.
- static preprocessing_entry_point(inference_worker_args)[source]
Entry point for the preprocessing worker process.
- Parameters:
inference_worker_args (dict ) – Arguments for initializing the
InferenceWorker, including runtime parameters such as hostname, devices and queues.
- static llm_inference_entry_point(inference_worker_args)[source]
Entry point for the LLM inference worker process.
- Parameters:
inference_worker_args (dict ) – Arguments for initializing the
InferenceWorker, including runtime parameters such as hostname, devices and queues.
- __init__(end_event, model_config: ModelConfig, batching_config: BatchingConfig, guardrails_config: GuardrailsConfig, dynamic_worker_config: DynamicWorkerConfig, dt, hostname: str = None, devices: list = None, head_cpu_pid: int = None, inf_wrkr_id: int = None, preprocessing_input_queue=None, preprocessing_output_queue=None, inf_wrkr_barrier=None, llm_proc_end_ev=None, inf_wrkr_down_ev=None, inf_wrkr_manager_q=None) None [source]
Initialize an inference worker instance.
- Parameters:
end_event (dragon.native.Event) – Primary event that terminates all processes.
model_config (ModelConfig) – Model configuration.
batching_config (BatchingConfig) – Batching configuration.
guardrails_config (GuardrailsConfig) – Guardrails/safety configuration.
dynamic_worker_config (DynamicWorkerConfig) – Dynamic worker configuration.
dt (dragon.telemetry.telemetry.Telemetry) – Dragon telemetry object.
hostname (str ) – Current process hostname.
devices (list [int ]) – List of GPU ranks for the current inference worker.
head_cpu_pid (int ) – Head CPU worker PID for the current inference worker.
inf_wrkr_id (int ) – Unique identifier for the current inference worker.
preprocessing_input_queue (dragon.native.Queue) – Input queue for the preprocessing worker.
preprocessing_output_queue (dragon.native.Queue) – Output queue for the preprocessing worker.
inf_wrkr_barrier (dragon.native.Barrier) – Barrier used to wait until all inference worker modules are ready.
llm_proc_end_ev (dragon.native.Event) – Event used to denote that the LLM module should spin down.
inf_wrkr_down_ev (dragon.native.Event) – Event used to denote that the entire inference worker should tear down.
inf_wrkr_manager_q (dragon.native.Queue) – Queue of tuples of the form
(hostname, devices, inf_wrkr_id).
- run_pre_processing_module()[source]
The pre-processing module performs batching and optional guardrails filtering.
Architecture: 1. Batching: Collect prompts into batches (DynamicBatcher) 2. Guardrails (optional): Filter malicious prompts (GuardrailsProcessor) 3. Forward to LLM: Send safe batches to LLM inference module
- Uses instance attributes set in __init__:
self.hostname, self.devices, self.preprocessing_input_queue, self.preprocessing_output_queue, self.head_cpu_pid, self.inf_wrkr_barrier, self.llm_proc_end_ev, self.inf_wrkr_id
- process_single_prompts(guardrails)[source]
Process individual prompts without batching (batch_size=1).
Architecture: 1. Read single prompt from input queue 2. Optionally filter through GuardrailsProcessor 3. Forward to LLM module immediately (no batching)
- Parameters:
guardrails (GuardrailsProcessor or None) – Optional GuardrailsProcessor instance.
- process_prebatched(guardrails)[source]
Process pre-batched inputs with optional guardrails filtering.
Architecture: 1. Receive already-batched inputs 2. Optionally filter through GuardrailsProcessor 3. Forward to LLM module
- Parameters:
guardrails (GuardrailsProcessor or None) – Optional GuardrailsProcessor instance.
- process_with_batching(batcher, guardrails)[source]
Process inputs with dynamic batching and optional guardrails filtering.
Architecture: 1. Read from input queue 2. Add to DynamicBatcher produces Batch when ready 3. Optionally filter Batch through GuardrailsProcessor 4. Forward to LLM module
All request types (text and chat) flow through the single
DynamicBatcher. Per-request fields (tools, json_schema_override, continue_final_message) are carried along as optionalBatchItemattributes.- Parameters:
batcher (DynamicBatcher) – DynamicBatcher instance.
guardrails (GuardrailsProcessor or None) – Optional GuardrailsProcessor instance.
- filter_with_guardrails(formatted_prompts: list , user_prompts: list , response_queues: list , latency_metrics: list , guardrails: GuardrailsProcessor | None )[source]
Apply guardrails filtering and return safe prompts and metrics.
- Parameters:
formatted_prompts (list ) – List of formatted prompts for the LLM.
user_prompts (list ) – List of original user prompts.
response_queues (list ) – List of response queues for each prompt.
latency_metrics (list ) – List of latency metrics for each prompt.
guardrails (GuardrailsProcessor or None) – GuardrailsProcessor instance, or
Noneif guardrails are disabled.
- Returns:
Tuple
(safe_formatted_prompts, safe_user_prompts, safe_response_queues, safe_latency_metrics, preprocessing_time, malicious_indices).- Return type:
- run_llm_inference_module()[source]
The LLM inference module orchestrates GPUs in a tensor-parallel environment to perform batch inference using vLLM.
- Uses instance attributes set in __init__:
self.hostname, self.head_cpu_pid, self.devices, self.preprocessing_input_queue (used as read_from_queue), self.inf_wrkr_barrier, self.llm_proc_end_ev, self.inf_wrkr_down_ev, self.inf_wrkr_manager_q, self.inf_wrkr_id