Jailbreak rule for guardrails
jailbreak
rule is used to detect and prevent attempts to bypass or manipulate the LLM intended behavior or restrictions. This rule helps maintain the integrity of the AI system by identifying potential exploits or malicious prompts that could lead to undesired or harmful responses.
jailbreak
(specifies the rule type)fail
to flag when a jailbreak attempt is detectedmodel_revision
in the rule, e.g., jailbreak@47ffb2e
.
If not specified, it will use the default revision for the model.
Rule Type | Latest Model Revision | Language support | Task Description | Metric and model card report |
---|---|---|---|---|
Jailbreak | jailbreak@47ffb2e | en, pt, es | Detects attempts to bypass AI safety measures | JailbreakGuard-10K |
Model | Revision | Benchmark |
---|---|---|
Jailbreak | jailbreak@47ffb2e (latest) | f1: 0.95@JailbreakGuard-10K |
Metric | Score | Dataset |
---|---|---|
Accuracy | 0.95 | JailbreakGuard-10K |
Precision | 0.93 | JailbreakGuard-10K |
Recall | 0.97 | JailbreakGuard-10K |
F1 Score | 0.95 | JailbreakGuard-10K |
AUC-ROC | 0.98 | JailbreakGuard-10K |
False Positive Rate | 0.03 | JailbreakGuard-10K |
True Negative Rate | 0.97 | JailbreakGuard-10K |