Accelerate AI Product Development with Freeplay's New Eval Creation & Alignment Tools

Accelerate AI Product Development with Freeplay's New Eval Creation & Alignment Tools

Accelerate AI Product Development with Freeplay's New Eval Creation & Alignment Tools

Aug 7, 2024

Aug 7, 2024

Evals function like unit tests for generative AI features, and they need to be just as easy to create any time you discover a new failure state. Today we’re releasing our new Eval Alignment workflow in Freeplay that makes it radically easier and faster for anyone on your team to add trustworthy, custom model-graded evals to your product.

It can now take less than 10 minutes to create, validate and launch a new model-graded eval that’s specific to your application.

You can then automatically enable it both for live monitoring of your production data and for scoring offline tests & experiments – all without touching code. We’re also making it much easier to tune and improve eval quality over time so you can continue to trust your evals as your product evolves.

We’ve been working on this problem for a while. From when we first announced tools to help define auto-evals in Freeplay over a year ago, we’ve given teams a UI where they can confirm or correct model-graded eval values whenever they see them — and passively build a ground truth dataset for evals in the process. In May this year we launched a beta alignment tool and got lots of helpful feedback from customers.

This new version reflects that feedback, and we already know it’s working. In multiple cases we’ve seen people go from 60% or 70% alignment to 100% in 2-3 iterations.

Here’s a quick demo video. More details on why we built this and how it works below. 👇


Why We Built This

Evals are essential to any successful generative AI product. 

One of the most valuable tools for evaluating LLM applications is model-graded evals, a.k.a. using “LLM-as-a-judge.” These evals use another LLM to score the output from a primary LLM used in your product. They play a critical role in scaling nuanced and complicated eval tasks that can’t be done in code, and are expensive for humans to do. 

Customized, product-specific evals are also hard to get right – which to most people means aligning them with expectations for human subject matter experts. When model-graded evals fail to match human expectations, trust erodes.

They’re also complicated to create. We regularly see people struggle to even articulate what they want to measure, and then after they do, turning it into an eval can be tedious. Not only do they need to tune a combined prompt and model for their application, but they also need a reliable benchmark dataset made up of their own application data to test against. These take effort to produce. Production data and models also evolve over time, which means evals require monitoring and tuning just like other prompts.

At the end of the day, evals function like unit tests in an application — which means you eventually need a lot of them. For the use of model-graded evals to reach their potential across the AI product ecosystem, they need to be much faster to create, tune, and optimize over time. 

Our new Eval Alignment workflow solves this problem. 


What “Good” Looks Like

To make a high-quality model-graded eval, you need the following:

  • A prompt & model that generate a score

  • A benchmark dataset that is sufficiently large enough to represent real-world customer scenarios, and that includes a balanced distribution of scores for the eval 

  • Ground truth labels on that benchmark dataset from reliable human experts, so you know how a human would expect each example in a dataset to be scored

Given that it can be a struggle to even start writing a good eval prompt, the work required to do the rest has been so onerous that many teams postpone it — even when they know they want and need good evals. We’re changing that!


Automating Eval Alignment: Our Approach

We’re building Freeplay for production software teams who need pragmatic solutions to their problems, and whose goal is always to spend more time delivering value for customers (vs. tinkering with AI systems).

These latest Freeplay updates take big steps toward automating the hard parts of creating model-graded evals, and making the manual parts as fast as possible. And perhaps most importantly, they help teams who are new to writing evals get over the hump to start using evals and adopt best practices from the start.

Here’s how the process works:

  1. Create an eval prompt

  2. Automatically generate a benchmark dataset from real data

  3. Test your new eval prompt & create ground truth labels at the same time

  4. Iterate to alignment (with clear versions!)

  5. Grow the dataset & strengthen the eval automatically as you use Freeplay


1. Creating an Eval Prompt

Model-graded evals are just prompts that take data from your application as inputs. They need to be configured and tuned together with the right model — just like normal prompts.

Freeplay makes it easy to get started with common eval templates (examples like Answer Faithfulness for RAG systems, Summary Quality, or PII detection), or to write your own. Since Freeplay also knows about the structure of your prompts and datasets, it’s easy to write prompts that target specific input/output values in your prompts and datasets.


2. Generating a Benchmark Dataset

Assessing the quality and trustworthiness of an auto-eval requires a dataset to test against. The trustworthiness of the eval is as much a function of the dataset as it is the eval prompt/model performance. If the dataset is too small or skewed in a particular direction, it can give a false sense of eval performance.

Since Freeplay monitors production logs, we can automatically bootstrap a benchmark dataset for each new eval. We select a random sample, intentionally over-sample, and select an initial set that includes a balanced distribution of eval values – i.e. both positive and negative values. Different perhaps from normal prompt testing, you never want to benchmark an eval on entirely positive/”golden set” answers since if you did, you’d have no way to know if the eval ever correctly detects negative values.

Once created, people can add/remove examples manually over time, or automatically expand the dataset when they want to scale it.


3. Testing & Labeling Results

The work to capture ground truth labels from human experts on your team can be particularly tedious. Once you define an eval prompt template and Freeplay has generated a benchmark dataset, we preview results in digestible chunks for human review and labeling – 10 examples at a time.

As reviewers examine each example, they can apply the score they think is appropriate for that eval. We show whether or not the model’s score was aligned, with an explanation for why the model made the choice it did. Reviewers can also update their score if they actually agree with the model’s explanation. 

We then track agreement/disagreement on each example to build up an “alignment score” for each version of your eval prompt. This gives you the confidence to know when the eval is ready to deploy.


4. Iterating To Alignment

At any time during a review, you can open the prompt and edit it. We’ve found it helpful to open and close this quickly so you can make changes like in a notepad for each new issue you discover.

Once you’ve completed the first 10 examples, we’ll show you an initial alignment score as well as a distribution of values for when you’ve agreed or disagreed with the model. That can give further insight into how you might need to tune an eval prompt, and/or supplement with more data. If you need more data to gain confidence, you can ask for the next 10 results to label.

When you’re ready, you can save and test a new version of the eval prompt. It will automatically run a test against the benchmark values you’ve already labeled so you instantly know if the alignment score has improved or not.


5. Strengthen Your Evals Automatically Over Time

As you use Freeplay to review production data on an ongoing basis, it’s natural to confirm or correct a model-graded score for any production example. When you do, those results automatically grow the ground truth benchmark dataset for eval validation. More ground truth data = more confidence! And if the alignment score starts to drift based on production results, Freeplay can alert you so that you can tune the eval and deploy a new version.

Thanks for reading! If you made it this far, we'd love to talk. Drop us a line at team[at]freeplay.ai, or sign up for access at Freeplay.ai.

Keep up with the latest


Keep up with the latest


Keep up with the latest