Guard language model
Understanding the evaluation and auto correction processes
Our language model is the core component designed to ensure that LLM outputs aligns with your specific policies, preferences, and security standards. It operates in two key steps: Evaluation and Auto correction.
Understanding the evaluation
During the evaluation step, the model takes three main inputs:
- User input and LLM output: A pair of the user input and the LLM output.
- Application policies: A set of predefined rules and guidelines that the content must adhere to.
- Threshold: A configurable parameter that controls how strict the evaluation is.
{
"messages": [
{
"role": "system", //optional
"content": "Example system instructions / prompt"
},
{
"role": "user",
"content": "Example user input"
},
{
"role": "assistant",
"content": "Example LLM output"
}
],
"application": "your_application_id",
"threshold": 0.3,
"correction_enabled": true,
"override_policy": [// your custom policies here
{
"id": "custom_policy_1",
"rule": "Should not share sensitive payment information, like credit card number, CVV and expiration date.",
"examples": [
{
"type": "FAIL",
"output_example": "This is the card number 1234-1234-1021-2201"
}, // could be LLM output example of fail
{
"type": "FAIL",
"output_example": "Sharing a credit card number, CVV, expiration date and any other detail from a credit card" // or just a description of what should fail
}
]
}
]
}
The model evaluate the messages
returns a risk score for each input (input_score
) and output (output_score
), along with a list of any violated policies (policy_violations
). The risk score indicates the degree to which the content complies with the policies. By adjusting the threshold, you can control the sensitivity of this evaluation—lowering the threshold enforces stricter adherence to the policies, while raising it allows for more leniency.
Setting up policies
Learn more about how to setup policies
Auto correction
Following the evaluation step, the auto correction automatically review the LLM output to ensure it meets the application’s policies in policy_violations
. It examines the specific violated policies and edit the content to bring it into compliance (correction
).
{
"id": "evaluation_id_12345",
"object": "evaluation",
"created": "2024-08-14T12:34:56Z",
"evaluation": {
"status": "FAIL",
"input_score": 0.4,
"output_score": 0.6,
"policy_violations": [
{
"policy_id": "pii_policy_01",
"score": 0.50
},
{
"policy_id": "intent_policy_02",
"score": 0.65
}
]
},
"correction": {
"choices": [
{
"role": "guard",
"content": "I'm sorry, but I can't share sensitive information. How else can I assist you?"
}
]
}
}