dragon.ai.inference.guardrails.GuardrailsProcessor

class GuardrailsProcessor[source]

Bases: object

Handles prompt safety checking using PromptGuard model.

This class is responsible ONLY for guardrails/safety checking, completely separated from batching and LLM inference logic.

__init__(config: GuardrailsConfig, hf_token: str )[source]

Initialize the guardrails processor.

Parameters:
  • config (GuardrailsConfig) – Guardrails configuration.

  • hf_token (str ) – HuggingFace token for model access.

Methods

__init__(config, hf_token)

Initialize the guardrails processor.

check_prompts(prompts)

Check a list of prompts for jailbreak attempts.

filter_batch(prompts, formatted_prompts, ...)

Filter a batch of prompts, separating safe from malicious ones.

get_malicious_response()

Get the standard response for malicious prompts.

__init__(config: GuardrailsConfig, hf_token: str )[source]

Initialize the guardrails processor.

Parameters:
  • config (GuardrailsConfig) – Guardrails configuration.

  • hf_token (str ) – HuggingFace token for model access.

check_prompts(prompts: List [str ]) Tuple [List [bool ], List [float ], float ][source]

Check a list of prompts for jailbreak attempts.

Parameters:

prompts (list [str ]) – List of user prompts to check.

Returns:

Tuple (is_safe, jailbreak_scores, processing_time) where is_safe is a list of booleans (True if safe, False if malicious), jailbreak_scores are the scores per prompt, and processing_time is the total processing time in seconds.

Return type:

tuple [list [bool ], list [float ], float ]

filter_batch(prompts: List [str ], formatted_prompts: List [str ], response_queues: List , latency_metrics: List [Tuple [float , float , float ]]) Tuple [List [str ], List [str ], List , List [Tuple [float , float , float ]], List [int ], float ][source]

Filter a batch of prompts, separating safe from malicious ones.

Parameters:
Returns:

Tuple (safe_prompts, safe_formatted, safe_queues, safe_metrics, malicious_indices, processing_time) where the safe_* lists contain only safe entries, malicious_indices are the indices of malicious prompts and processing_time is the guardrails processing time in seconds.

Return type:

tuple

get_malicious_response() str [source]

Get the standard response for malicious prompts.

Returns:

Standard response string for malicious prompts.

Return type:

str