jailbreak
rule is used to detect and prevent attempts to bypass or manipulate the LLM intended behavior or restrictions. This rule helps maintain the integrity of the AI system by identifying potential exploits or malicious prompts that could lead to undesired or harmful responses.
Parameters:
- type:
jailbreak
(specifies the rule type) - expected: Usually set to
fail
to flag when a jailbreak attempt is detected - threshold: A number between 0 and 1 representing the confidence level for jailbreak detection (e.g., 0.9 for 90% confidence)
Model revisions
Usually you can specify the model revision by passing themodel_revision
in the rule, e.g., jailbreak@47ffb2e
.
If not specified, it will use the default revision for the model.
Supported models
Rule Type | Latest Model Revision | Language support | Task Description | Metric and model card report |
---|---|---|---|---|
Jailbreak | jailbreak@47ffb2e | en, pt, es | Detects attempts to bypass AI safety measures | JailbreakGuard-10K |
Revisions
Model | Revision | Benchmark |
---|---|---|
Jailbreak | jailbreak@47ffb2e (latest) | f1: 0.95@JailbreakGuard-10K |
Model metrics
Jailbreak@47ffb2e
Metric | Score | Dataset |
---|---|---|
Accuracy | 0.95 | JailbreakGuard-10K |
Precision | 0.93 | JailbreakGuard-10K |
Recall | 0.97 | JailbreakGuard-10K |
F1 Score | 0.95 | JailbreakGuard-10K |
AUC-ROC | 0.98 | JailbreakGuard-10K |
False Positive Rate | 0.03 | JailbreakGuard-10K |
True Negative Rate | 0.97 | JailbreakGuard-10K |