dragon.ai.inference.prompt_guard_utils.PromptGuard

class PromptGuard[source]

Bases: object

Utilities for loading the PromptGuard model and evaluating text for jailbreaks and indirect injections.

Note that the underlying model has a maximum recommended input size of 512 tokens as a DeBERTa model. The final two functions in this file implement efficient parallel batched evaluation of the model on a list of input strings of arbitrary length, with the final score for each input being the maximum score across all chunks of the input string.

__init__(model: str , hf_token: str ) None [source]

Initialize the PromptGuard model wrapper.

Parameters:
  • model (str ) – Name or path of the HuggingFace PromptGuard model.

  • hf_token (str ) – HuggingFace token.

Methods

__init__(model, hf_token)

Initialize the PromptGuard model wrapper.

get_class_probabilities(text[, temperature, ...])

Evaluate the model on the given text with temperature-adjusted softmax.

get_indirect_injection_score(text[, ...])

Evaluate the probability that text contains embedded instructions.

get_indirect_injection_scores_for_texts(texts)

Compute indirect injection scores for a list of texts.

get_jailbreak_score(text[, temperature, ...])

Evaluate the probability that a string contains a jailbreak.

get_jailbreak_scores_for_texts(texts[, ...])

Compute jailbreak scores for a list of texts.

get_scores_for_texts(texts, score_indices[, ...])

Compute scores for a list of texts.

load_model_and_tokenizer(model_name, hf_token)

Load the PromptGuard model from Hugging Face or a local model.

preprocess_text_for_promptguard(text)

Preprocess the text by removing spaces that break apart larger tokens.

process_text_batch(texts[, temperature, ...])

Process a batch of texts and return their class probabilities.

__init__(model: str , hf_token: str ) None [source]

Initialize the PromptGuard model wrapper.

Parameters:
  • model (str ) – Name or path of the HuggingFace PromptGuard model.

  • hf_token (str ) – HuggingFace token.

load_model_and_tokenizer(model_name: str , hf_token: str )[source]

Load the PromptGuard model from Hugging Face or a local model.

Parameters:
  • model_name (str ) – Name or path of the HuggingFace PromptGuard model.

  • hf_token (str ) – HuggingFace token.

Returns:

Tuple (model, tokenizer).

Return type:

tuple

preprocess_text_for_promptguard(text: str ) str [source]

Preprocess the text by removing spaces that break apart larger tokens. This hotfixes a workaround to PromptGuard, where spaces can be inserted into a string to allow the string to be classified as benign.

Parameters:

text (str ) – Input text to preprocess.

Returns:

Preprocessed text.

Return type:

str

get_class_probabilities(text: str , temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source]

Evaluate the model on the given text with temperature-adjusted softmax.

Note that, as this is a DeBERTa model, the input text should have a maximum length of 512 tokens.

Parameters:
  • text (str ) – Input text to classify.

  • temperature (float ) – Temperature for the softmax function.

  • device (str ) – Device on which to evaluate the model.

  • preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Probability of each class adjusted by the temperature.

Return type:

torch.Tensor

get_jailbreak_score(text: str , temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source]

Evaluate the probability that a string contains a jailbreak.

This is suitable for filtering dialogue between a user and an LLM.

Parameters:
  • text (str ) – Input text to evaluate.

  • temperature (float ) – Temperature for the softmax function.

  • device (str ) – Device on which to evaluate the model.

  • preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Tuple (score, elapsed_time) where score is the probability of malicious content and elapsed_time is the time taken to compute it.

Return type:

tuple [float , float ]

get_indirect_injection_score(text: str , temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source]

Evaluate the probability that text contains embedded instructions.

This includes both malicious and benign instructions and is intended for filtering third-party inputs (for example, web searches or tool outputs) into an LLM.

Parameters:
  • text (str ) – Input text to evaluate.

  • temperature (float ) – Temperature for the softmax function.

  • device (str ) – Device on which to evaluate the model.

  • preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Tuple (score, elapsed_time) where score is the combined probability of embedded instructions and elapsed_time is the time taken to compute it.

Return type:

tuple [float , float ]

process_text_batch(texts: List [str ], temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source]

Process a batch of texts and return their class probabilities.

Parameters:
  • texts (list [str ]) – List of texts to process.

  • temperature (float ) – Temperature for the softmax function.

  • device (str ) – Device on which to evaluate the model.

  • preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Tensor containing the class probabilities for each text in the batch.

Return type:

torch.Tensor

get_scores_for_texts(texts: List [str ], score_indices: List [int ], temperature: float = 1.0, device: str = 'cpu', max_batch_size: int = 16, preprocess: bool = True)[source]

Compute scores for a list of texts.

Texts of arbitrary length are broken into chunks and processed in parallel, with the final score for each text being the maximum across chunks.

Parameters:
  • texts (list [str ]) – List of texts to evaluate.

  • score_indices (list [int ]) – Indices of classes whose scores are summed for the final score calculation.

  • temperature (float ) – Temperature for the softmax function.

  • device (str ) – Device on which to evaluate the model.

  • max_batch_size (int ) – Maximum number of text chunks to process in a single batch.

  • preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

List of scores for each text.

Return type:

list [float ]

get_jailbreak_scores_for_texts(texts: List [str ], temperature: float = 1.0, device: str = 'cpu', max_batch_size: int = 16, preprocess: bool = True)[source]

Compute jailbreak scores for a list of texts.

Parameters:
  • texts (list [str ]) – List of texts to evaluate.

  • temperature (float ) – Temperature for the softmax function.

  • device (str ) – Device on which to evaluate the model.

  • max_batch_size (int ) – Maximum number of text chunks to process in a single batch.

  • preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Tuple (scores, elapsed_time) where scores is the list of jailbreak scores and elapsed_time is the total time taken.

Return type:

tuple [list [float ], float ]

get_indirect_injection_scores_for_texts(texts: List [str ], temperature: float = 1.0, device: str = 'cpu', max_batch_size: int = 16, preprocess: bool = True)[source]

Compute indirect injection scores for a list of texts.

Parameters:
  • texts (list [str ]) – List of texts to evaluate.

  • temperature (float ) – Temperature for the softmax function.

  • device (str ) – Device on which to evaluate the model.

  • max_batch_size (int ) – Maximum number of text chunks to process in a single batch.

  • preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Tuple (scores, elapsed_time) where scores is the list of indirect injection scores and elapsed_time is the total time taken.

Return type:

tuple [list [float ], float ]