Model based revisions for guardrails
model_revision
in the rule, e.g., jailbreak@47ffb2e
.
If not specified, it will use the default revision for the model.
Rule Type | Latest Model Revision | Language support | Task Description | Metric and model card report |
---|---|---|---|---|
Jailbreak | jailbreak@47ffb2e | en, pt, es | Detects attempts to bypass AI safety measures | JailbreakGuard-10K |
Model | Revision | Benchmark |
---|---|---|
Jailbreak | jailbreak@47ffb2e (latest) | f1: 0.95@JailbreakGuard-10K |
Metric | Score | Dataset |
---|---|---|
Accuracy | 0.95 | JailbreakGuard-10K |
Precision | 0.93 | JailbreakGuard-10K |
Recall | 0.97 | JailbreakGuard-10K |
F1 Score | 0.95 | JailbreakGuard-10K |
AUC-ROC | 0.98 | JailbreakGuard-10K |
False Positive Rate | 0.03 | JailbreakGuard-10K |
True Negative Rate | 0.97 | JailbreakGuard-10K |