Guard language model

Our language model is the core component designed to ensure that LLM outputs aligns with your specific policies, preferences, and security standards. It operates in two key steps: Evaluation and Auto correction.

Understanding the evaluation

During the evaluation step, the model takes three main inputs:

User input and LLM output: A pair of the user input and the LLM output.
Application policies: A set of predefined rules and guidelines that the content must adhere to.
Threshold: A configurable parameter that controls how strict the evaluation is.

{
  "messages": [
    {
      "role": "system", //optional
      "content": "Example system instructions / prompt"
    },
    {
      "role": "user",
      "content": "Example user input"
    },
    {
      "role": "assistant",
      "content": "Example LLM output"
    }
  ],
  "application": "your_application_id",
  "threshold": 0.3,
  "correction_enabled": true,
  "override_policy": [// your custom policies here
  {
      "id": "custom_policy_1",
      "rule": "Should not share sensitive payment information, like credit card number, CVV and expiration date.",
      "examples": [
        {
          "type": "FAIL",
          "output_example": "This is the card number 1234-1234-1021-2201"
        }, // could be LLM output example of fail
        {
          "type": "FAIL",
          "output_example": "Sharing a credit card number, CVV, expiration date and any other detail from a credit card" // or just a description of what should fail
        }
      ]
  }
  ]
}

The model evaluate the messages returns a risk score for each input (input_score) and output (output_score), along with a list of any violated policies (policy_violations). The risk score indicates the degree to which the content complies with the policies. By adjusting the threshold, you can control the sensitivity of this evaluation—lowering the threshold enforces stricter adherence to the policies, while raising it allows for more leniency.

Setting up policies

Learn more about how to setup policies

Auto correction

Following the evaluation step, the auto correction automatically review the LLM output to ensure it meets the application’s policies in policy_violations. It examines the specific violated policies and edit the content to bring it into compliance (correction).

{
  "id": "evaluation_id_12345",
  "object": "evaluation",
  "created": "2024-08-14T12:34:56Z",
  "evaluation": {
    "status": "FAIL",
    "input_score": 0.4,
    "output_score": 0.6,
    "policy_violations": [
      {
        "policy_id": "pii_policy_01",
        "score": 0.50
      },
      {
        "policy_id": "intent_policy_02",
        "score": 0.65
      }
    ]
  },
  "correction": {
    "choices": [
      {
        "role": "guard",
        "content": "I'm sorry, but I can't share sensitive information. How else can I assist you?"
      }
    ]
  }
}

Setting up policies

Learn how to setup Guard Evaluate API to your own application policy

Guard Evaluate API

Learn how to use and integrate our API

AI Safety endpoints

Finetune & Inference endpoints

Guard language model

Understanding the evaluation

Setting up policies

Auto correction

Setting up policies

Guard Evaluate API

AI Safety endpoints

Finetune & Inference endpoints

​Understanding the evaluation

Setting up policies

​Auto correction

Setting up policies

Guard Evaluate API

Understanding the evaluation

Auto correction