To detect prompt injections, you can use the rule jailbreak, for more details, see our Rules Catalog.

Rule structure:

  • type: jailbreak
  • expected: fail
  • threshold: 0.8

Create the policy

Here is an example of policy to detect injection:


policy = {
    "id": "detect-injection",
    "definition": "...",
    "rules": [
        {
            "type": "jailbreak",
            "expeted": "fail",
            "threshold": 0.9
        }
    ],
    "target": "input"
}

Next steps

  • Create the policy by using the application endpoint
  • Call the Evaluate API with the messages and policy id detect-injection
  • The API output would be a status being fail or pass, and the list of policy violations. You could check if the policy id detect-injection is there.