Jailbreak

The jailbreak rule is used to detect and prevent attempts to bypass or manipulate the LLM intended behavior or restrictions. This rule helps maintain the integrity of the AI system by identifying potential exploits or malicious prompts that could lead to undesired or harmful responses.

Parameters:

type: jailbreak (specifies the rule type)
expected: Usually set to fail to flag when a jailbreak attempt is detected
threshold: A number between 0 and 1 representing the confidence level for jailbreak detection (e.g., 0.9 for 90% confidence)

For a practical example of how to implement this rule in a policy, see our Prevent Jailbreak Attempts policy example.

Model revisions

Usually you can specify the model revision by passing the model_revision in the rule, e.g., jailbreak@47ffb2e. If not specified, it will use the default revision for the model.

Supported models

Rule Type	Latest Model Revision	Language support	Task Description	Metric and model card report
Jailbreak	`jailbreak@47ffb2e`	en, pt, es	Detects attempts to bypass AI safety measures	JailbreakGuard-10K

Revisions

Model	Revision	Benchmark
Jailbreak	`jailbreak@47ffb2e` (latest)	f1: 0.95@JailbreakGuard-10K

Model metrics

Jailbreak@47ffb2e

Metric	Score	Dataset
Accuracy	0.95	JailbreakGuard-10K
Precision	0.93	JailbreakGuard-10K
Recall	0.97	JailbreakGuard-10K
F1 Score	0.95	JailbreakGuard-10K
AUC-ROC	0.98	JailbreakGuard-10K
False Positive Rate	0.03	JailbreakGuard-10K
True Negative Rate	0.97	JailbreakGuard-10K

The jailbreak model was benchmarked using the JailbreakGuard-10K dataset, which consists of 10,000 carefully curated examples of potential jailbreak attempts and safe interactions in English, Portuguese, and Spanish. This dataset covers a wide range of advanced jailbreak techniques and ensures robust performance across various scenarios.

Getting started

Real-time protection

Risk assessment

Manage GenAI applications

Parameters:

Model revisions

Supported models

Revisions

Model metrics

Jailbreak@47ffb2e

Getting started

Real-time protection

Risk assessment

Manage GenAI applications

​Parameters:

​Model revisions

​Supported models

​Revisions

​Model metrics

​Jailbreak@47ffb2e

Parameters:

Model revisions

Supported models

Revisions

Model metrics

Jailbreak@47ffb2e