classifier
rule. This rule helps prevent the AI from using or responding with specific words or phrases that are deemed inappropriate or sensitive. For more details, see our Rules Catalog.
Rule structure:
- type:
classifier
- value: List of topics to classify, e.g,
"hate speech, harassment, sexual content, self-harm
- expected:
fail
(to flag when a blocked topic is detected) - threshold: Confidence level for detection (e.g., 0.9 for 90% confidence)