dragon.ai.inference.prompt_guard_utils.PromptGuard

class PromptGuard[source] 

Bases: object

Evaluate text with a PromptGuard jailbreak/injection classifier.

PromptGuard is a DeBERTa-based classifier with a recommended maximum input length of 512 tokens. Single-text methods score one truncated input. The batched scoring methods split longer strings into chunks and aggregate by taking the maximum score per original input, so a high-risk chunk causes the full input to be treated as high risk.

The preprocessing step removes adversarial whitespace that may otherwise split meaningful tokens and reduce the classifier score.

__init__(model: str , hf_token: str ) → None [source] 

Initialize the PromptGuard model wrapper.

Parameters:

model (str ) – Name or path of the HuggingFace PromptGuard model.
hf_token (str ) – HuggingFace token.

Methods

`__init__`(model, hf_token)	Initialize the PromptGuard model wrapper.
`get_class_probabilities`(text[, temperature, ...])	Evaluate the model on the given text with temperature-adjusted softmax.
`get_indirect_injection_score`(text[, ...])	Evaluate the probability that text contains embedded instructions.
`get_indirect_injection_scores_for_texts`(texts)	Compute indirect injection scores for a list of texts.
`get_jailbreak_score`(text[, temperature, ...])	Evaluate the probability that a string contains a jailbreak.
`get_jailbreak_scores_for_texts`(texts[, ...])	Compute jailbreak scores for a list of texts.
`get_scores_for_texts`(texts, score_indices[, ...])	Compute scores for a list of texts.
`load_model_and_tokenizer`(model_name, hf_token)	Load the PromptGuard model from Hugging Face or a local model.
`preprocess_text_for_promptguard`(text)	Preprocess the text by removing spaces that break apart larger tokens.
`process_text_batch`(texts[, temperature, ...])	Process a batch of texts and return their class probabilities.

__init__(model: str , hf_token: str ) → None [source] 

Initialize the PromptGuard model wrapper.

Parameters:

model (str ) – Name or path of the HuggingFace PromptGuard model.
hf_token (str ) – HuggingFace token.

load_model_and_tokenizer(model_name: str , hf_token: str )[source] 

Load the PromptGuard model from Hugging Face or a local model.

Parameters:

model_name (str ) – Name or path of the HuggingFace PromptGuard model.
hf_token (str ) – HuggingFace token.

Returns:

Tuple (model, tokenizer).

Return type:

tuple

preprocess_text_for_promptguard(text: str ) → str [source] 

Preprocess the text by removing spaces that break apart larger tokens. This hotfixes a workaround to PromptGuard, where spaces can be inserted into a string to allow the string to be classified as benign.

Parameters:: text (str ) – Input text to preprocess.
Returns:: Preprocessed text.
Return type:: str

get_class_probabilities(text: str , temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source] 

Evaluate the model on the given text with temperature-adjusted softmax.

Note that, as this is a DeBERTa model, the input text should have a maximum length of 512 tokens.

Parameters:

text (str ) – Input text to classify.
temperature (float ) – Temperature for the softmax function.
device (str ) – Device on which to evaluate the model.
preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Probability of each class adjusted by the temperature.

Return type:

torch.Tensor

get_jailbreak_score(text: str , temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source] 

Evaluate the probability that a string contains a jailbreak.

This is suitable for filtering dialogue between a user and an LLM.

Parameters:

text (str ) – Input text to evaluate.
temperature (float ) – Temperature for the softmax function.
device (str ) – Device on which to evaluate the model.
preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Tuple (score, elapsed_time) where score is the probability of malicious content and elapsed_time is the time taken to compute it.

Return type:

tuple [float , float ]

get_indirect_injection_score(text: str , temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source] 

Evaluate the probability that text contains embedded instructions.

This includes both malicious and benign instructions and is intended for filtering third-party inputs (for example, web searches or tool outputs) into an LLM.

Parameters:

text (str ) – Input text to evaluate.
temperature (float ) – Temperature for the softmax function.
device (str ) – Device on which to evaluate the model.
preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Tuple (score, elapsed_time) where score is the combined probability of embedded instructions and elapsed_time is the time taken to compute it.

Return type:

tuple [float , float ]

process_text_batch(texts: List [str ], temperature: float = 1.0, device: str = 'cpu', preprocess: bool = True)[source] 

Process a batch of texts and return their class probabilities.

Parameters:

texts (list [str ]) – List of texts to process.
temperature (float ) – Temperature for the softmax function.
device (str ) – Device on which to evaluate the model.
preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Tensor containing the class probabilities for each text in the batch.

Return type:

torch.Tensor

get_scores_for_texts(texts: List [str ], score_indices: List [int ], temperature: float = 1.0, device: str = 'cpu', max_batch_size: int = 16, preprocess: bool = True)[source] 

Compute scores for a list of texts.

Texts of arbitrary length are broken into chunks and processed in parallel, with the final score for each text being the maximum across chunks.

Parameters:

texts (list [str ]) – List of texts to evaluate.
score_indices (list [int ]) – Indices of classes whose scores are summed for the final score calculation.
temperature (float ) – Temperature for the softmax function.
device (str ) – Device on which to evaluate the model.
max_batch_size (int ) – Maximum number of text chunks to process in a single batch.
preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

List of scores for each text.

Return type:

list [float ]

get_jailbreak_scores_for_texts(texts: List [str ], temperature: float = 1.0, device: str = 'cpu', max_batch_size: int = 16, preprocess: bool = True)[source] 

Compute jailbreak scores for a list of texts.

Parameters:

texts (list [str ]) – List of texts to evaluate.
temperature (float ) – Temperature for the softmax function.
device (str ) – Device on which to evaluate the model.
max_batch_size (int ) – Maximum number of text chunks to process in a single batch.
preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Tuple (scores, elapsed_time) where scores is the list of jailbreak scores and elapsed_time is the total time taken.

Return type:

tuple [list [float ], float ]

get_indirect_injection_scores_for_texts(texts: List [str ], temperature: float = 1.0, device: str = 'cpu', max_batch_size: int = 16, preprocess: bool = True)[source] 

Compute indirect injection scores for a list of texts.

Parameters:

texts (list [str ]) – List of texts to evaluate.
temperature (float ) – Temperature for the softmax function.
device (str ) – Device on which to evaluate the model.
max_batch_size (int ) – Maximum number of text chunks to process in a single batch.
preprocess (bool ) – Whether to run input-length preprocessing.

Returns:

Tuple (scores, elapsed_time) where scores is the list of indirect injection scores and elapsed_time is the total time taken.

Return type:

tuple [list [float ], float ]