Rules catalog
Model based rules
Model based revisions for guardrails
Model based revisions are used to specify a specific revision of a model to be used for a particular guardrail or policy.
Usually you can specify the model revision by passing the model_revision
in the rule, e.g., jailbreak@47ffb2e
.
If not specified, it will use the default revision for the model.
Supported models
Rule Type | Latest Model Revision | Language support | Task Description | Metric and model card report |
---|---|---|---|---|
Jailbreak | jailbreak@47ffb2e | en, pt, es | Detects attempts to bypass AI safety measures | JailbreakGuard-10K |
Revisions
Model | Revision | Benchmark |
---|---|---|
Jailbreak | jailbreak@47ffb2e (latest) | f1: 0.95@JailbreakGuard-10K |
Model metrics
Jailbreak@47ffb2e
Metric | Score | Dataset |
---|---|---|
Accuracy | 0.95 | JailbreakGuard-10K |
Precision | 0.93 | JailbreakGuard-10K |
Recall | 0.97 | JailbreakGuard-10K |
F1 Score | 0.95 | JailbreakGuard-10K |
AUC-ROC | 0.98 | JailbreakGuard-10K |
False Positive Rate | 0.03 | JailbreakGuard-10K |
True Negative Rate | 0.97 | JailbreakGuard-10K |
The jailbreak model was benchmarked using the JailbreakGuard-10K dataset, which consists of 10,000 carefully curated examples of potential jailbreak attempts and safe interactions in English, Portuguese, and Spanish. This dataset covers a wide range of advanced jailbreak techniques and ensures robust performance across various scenarios.