dragon.ai.inference.prompt_guard_utils.PromptGuard
- class PromptGuard[source]
Bases:
objectUtilities for loading the PromptGuard model and evaluating text for jailbreaks and indirect injections.
Note that the underlying model has a maximum recommended input size of 512 tokens as a DeBERTa model. The final two functions in this file implement efficient parallel batched evaluation of the model on a list of input strings of arbitrary length, with the final score for each input being the maximum score across all chunks of the input string.
Methods
__init__(model, hf_token)Initialize the PromptGuard model wrapper.
get_class_probabilities(text[, temperature, ...])Evaluate the model on the given text with temperature-adjusted softmax.
get_indirect_injection_score(text[, ...])Evaluate the probability that text contains embedded instructions.
Compute indirect injection scores for a list of texts.
get_jailbreak_score(text[, temperature, ...])Evaluate the probability that a string contains a jailbreak.
get_jailbreak_scores_for_texts(texts[, ...])Compute jailbreak scores for a list of texts.
get_scores_for_texts(texts, score_indices[, ...])Compute scores for a list of texts.
load_model_and_tokenizer(model_name, hf_token)Load the PromptGuard model from Hugging Face or a local model.
Preprocess the text by removing spaces that break apart larger tokens.
process_text_batch(texts[, temperature, ...])Process a batch of texts and return their class probabilities.
- load_model_and_tokenizer(model_name: str , hf_token: str )[source]
Load the PromptGuard model from Hugging Face or a local model.
- preprocess_text_for_promptguard(text: str ) str [source]
Preprocess the text by removing spaces that break apart larger tokens. This hotfixes a workaround to PromptGuard, where spaces can be inserted into a string to allow the string to be classified as benign.
- get_class_probabilities(text: str , temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source]
Evaluate the model on the given text with temperature-adjusted softmax.
Note that, as this is a DeBERTa model, the input text should have a maximum length of 512 tokens.
- Parameters:
- Returns:
Probability of each class adjusted by the temperature.
- Return type:
torch.Tensor
- get_jailbreak_score(text: str , temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source]
Evaluate the probability that a string contains a jailbreak.
This is suitable for filtering dialogue between a user and an LLM.
- Parameters:
- Returns:
Tuple
(score, elapsed_time)wherescoreis the probability of malicious content andelapsed_timeis the time taken to compute it.- Return type:
- get_indirect_injection_score(text: str , temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source]
Evaluate the probability that text contains embedded instructions.
This includes both malicious and benign instructions and is intended for filtering third-party inputs (for example, web searches or tool outputs) into an LLM.
- Parameters:
- Returns:
Tuple
(score, elapsed_time)wherescoreis the combined probability of embedded instructions andelapsed_timeis the time taken to compute it.- Return type:
- process_text_batch(texts: List [str ], temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source]
Process a batch of texts and return their class probabilities.
- Parameters:
- Returns:
Tensor containing the class probabilities for each text in the batch.
- Return type:
torch.Tensor
- get_scores_for_texts(texts: List [str ], score_indices: List [int ], temperature: float = 1.0, device: str = 'cpu', max_batch_size: int = 16, preprocess: bool = True)[source]
Compute scores for a list of texts.
Texts of arbitrary length are broken into chunks and processed in parallel, with the final score for each text being the maximum across chunks.
- Parameters:
score_indices (list [int ]) – Indices of classes whose scores are summed for the final score calculation.
temperature (float ) – Temperature for the softmax function.
device (str ) – Device on which to evaluate the model.
max_batch_size (int ) – Maximum number of text chunks to process in a single batch.
preprocess (bool ) – Whether to run input-length preprocessing.
- Returns:
List of scores for each text.
- Return type:
- get_jailbreak_scores_for_texts(texts: List [str ], temperature: float = 1.0, device: str = 'cpu', max_batch_size: int = 16, preprocess: bool = True)[source]
Compute jailbreak scores for a list of texts.
- Parameters:
- Returns:
Tuple
(scores, elapsed_time)wherescoresis the list of jailbreak scores andelapsed_timeis the total time taken.- Return type:
- get_indirect_injection_scores_for_texts(texts: List [str ], temperature: float = 1.0, device: str = 'cpu', max_batch_size: int = 16, preprocess: bool = True)[source]
Compute indirect injection scores for a list of texts.
- Parameters:
- Returns:
Tuple
(scores, elapsed_time)wherescoresis the list of indirect injection scores andelapsed_timeis the total time taken.- Return type: