Jailbreak
Jailbreak rule for guardrails
The jailbreak
rule is used to detect and prevent attempts to bypass or manipulate the LLM intended behavior or restrictions. This rule helps maintain the integrity of the AI system by identifying potential exploits or malicious prompts that could lead to undesired or harmful responses.
Parameters:
- type:
jailbreak
(specifies the rule type) - expected: Usually set to
fail
to flag when a jailbreak attempt is detected - threshold: A number between 0 and 1 representing the confidence level for jailbreak detection (e.g., 0.9 for 90% confidence)
For a practical example of how to implement this rule in a policy, see our Prevent Jailbreak Attempts policy example.
Model revisions
Usually you can specify the model revision by passing the model_revision
in the rule, e.g., jailbreak@47ffb2e
.
If not specified, it will use the default revision for the model.
Supported models
Rule Type | Latest Model Revision | Language support | Task Description | Metric and model card report |
---|---|---|---|---|
Jailbreak | jailbreak@47ffb2e | en, pt, es | Detects attempts to bypass AI safety measures | JailbreakGuard-10K |
Revisions
Model | Revision | Benchmark |
---|---|---|
Jailbreak | jailbreak@47ffb2e (latest) | f1: 0.95@JailbreakGuard-10K |
Model metrics
Jailbreak@47ffb2e
Metric | Score | Dataset |
---|---|---|
Accuracy | 0.95 | JailbreakGuard-10K |
Precision | 0.93 | JailbreakGuard-10K |
Recall | 0.97 | JailbreakGuard-10K |
F1 Score | 0.95 | JailbreakGuard-10K |
AUC-ROC | 0.98 | JailbreakGuard-10K |
False Positive Rate | 0.03 | JailbreakGuard-10K |
True Negative Rate | 0.97 | JailbreakGuard-10K |
The jailbreak model was benchmarked using the JailbreakGuard-10K dataset, which consists of 10,000 carefully curated examples of potential jailbreak attempts and safe interactions in English, Portuguese, and Spanish. This dataset covers a wide range of advanced jailbreak techniques and ensures robust performance across various scenarios.