To implement a classifier for topics and names, you can use the classifier rule. This rule helps prevent the AI from using or responding with specific words or phrases that are deemed inappropriate or sensitive. For more details, see our Rules Catalog.

Rule structure:

  • type: classifier
  • value: List of topics to classify, e.g, "hate speech, harassment, sexual content, self-harm
  • expected: fail (to flag when a blocked topic is detected)
  • threshold: Confidence level for detection (e.g., 0.9 for 90% confidence)

Create the policy

Here’s an example of a policy to implement a classifier:

{
  "id": "unique policy id",
  "definition": "short description",
  "rules": [
    {
      "type": "classifier",
      "value": "hate speech, harassment, sexual content, self-harm",
      "expected": "fail",
      "threshold": 0.9
    }
  ],
  "target": "input"
}