The jailbreak rule is used to detect and prevent attempts to bypass or manipulate the LLM intended behavior or restrictions. This rule helps maintain the integrity of the AI system by identifying potential exploits or malicious prompts that could lead to undesired or harmful responses.

Parameters:

  • type: jailbreak (specifies the rule type)
  • expected: Usually set to fail to flag when a jailbreak attempt is detected
  • threshold: A number between 0 and 1 representing the confidence level for jailbreak detection (e.g., 0.9 for 90% confidence)

For a practical example of how to implement this rule in a policy, see our Prevent Jailbreak Attempts policy example.

Model revisions

Usually you can specify the model revision by passing the model_revision in the rule, e.g., jailbreak@47ffb2e.

If not specified, it will use the default revision for the model.

Supported models

Rule TypeLatest Model RevisionLanguage supportTask DescriptionMetric and model card report
Jailbreakjailbreak@47ffb2een, pt, esDetects attempts to bypass AI safety measuresJailbreakGuard-10K

Revisions

ModelRevisionBenchmark
Jailbreakjailbreak@47ffb2e (latest)f1: 0.95@JailbreakGuard-10K

Model metrics

Jailbreak@47ffb2e

MetricScoreDataset
Accuracy0.95JailbreakGuard-10K
Precision0.93JailbreakGuard-10K
Recall0.97JailbreakGuard-10K
F1 Score0.95JailbreakGuard-10K
AUC-ROC0.98JailbreakGuard-10K
False Positive Rate0.03JailbreakGuard-10K
True Negative Rate0.97JailbreakGuard-10K

The jailbreak model was benchmarked using the JailbreakGuard-10K dataset, which consists of 10,000 carefully curated examples of potential jailbreak attempts and safe interactions in English, Portuguese, and Spanish. This dataset covers a wide range of advanced jailbreak techniques and ensures robust performance across various scenarios.