Feedback, Model Retraining, and Benchmarking

This is in Public Beta for Enterprise plans.

Feedback and Counterfactual Example Generation

When users submit feedback through the "Teach Your AI" feature, that feedback is processed using a specialized reasoning large language model. This model creates hypothetical examples based on your feedback, a method known as counterfactual generation. Counterfactual generation involves imagining "what-if" scenarios by altering small parts of a given example to test and improve the model's understanding.

Importantly, your exact feedback is not directly used to train your model. Instead, the AI generates new hypothetical examples inspired by your input. Both your original feedback and the hypothetical examples are kept private and are not shared with Protege's general ShieldLlama model.

Model Retraining Process

  • Models are retrained nightly at midnight Pacific Time on working days.

  • If a newly trained model exceeds the benchmark set by the prior model, it will be automatically promoted to active use.

Benchmarking and Performance Measurement

To ensure quality improvements, we use multiple types of benchmarks:

1. Policy Annotation Set

These are real-world examples where users have labeled AI feedback as incorrect. The model is tested against these annotations to ensure it better captures such cases over time.

2. Test Set

A portion (10%) of the synthetically generated dataset, based on user examples, is held out of the training process. This "test set" is used to objectively measure model performance on unseen data.

3. Confusion Matrix

We also analyze model results using a confusion matrix, which breaks down predictions into four categories:

  • True Positive (Green Box): Correctly identified errors.

  • True Negative (Green Box): Correctly ignored non-errors.

  • False Positive (Red Box): Incorrectly flagged something as an error.

  • False Negative (Red Box): Missed identifying a true error.

The goal is to maximize values in the green boxes (true positives and true negatives) while minimizing values in the red boxes (false positives and false negatives).

Model Versions and Update Numbering

Minor Version Updates

Version updates that are specific to your environment are reflected through minor number increments. For example, in version 10.0.0 to 10.0.1, the last digit would increase with each minor update that come from Teach your AI fixes.

Major Version Upgrades (Model Rebasing)

On a regular cadence, Protege will perform model rebasing, updating your model with the latest ShieldLlama model that includes policies, precedents, and compliance standards observed across industries (e.g., FTC, NAD, FDIC, and more). These substantial updates are reflected in major version upgrades (e.g., moving from version 10.0 to 11.0, the first digit).

To set up a fine-tuned model for your organization, reach out to [email protected].

Last updated

Was this helpful?