Building an LLM Eval Suite That Actually Works in Practice

Evals are so hot right now. 😉

A year ago, we heard a lot of teams saying “We know they’ll be important, but not a priority.” Now, we’re seeing a huge influx of interest across the industry.

So many companies have matured from experiments & prototypes to mature(ing) products in production. People are realizing you can’t build and iterate on a generative AI product without good evals.

And, as more teams start to take evals seriously, we’ve discovered that many don’t know where to start and some of the different concepts can be confusing.

The starting point often feels obvious — “We need relevant evals to measure the performance of our product” — but the details of "how" can be fuzzy. There’s also a lot of noise in the market, with people debating the “best evals.” And there's always temptation to adopt simple off-the-shelf metrics that might not actually be that relevant or helpful.

At Freeplay, our mindset and what we’ve seen work in practice is aligned: Get started by considering what needs to be improved, decide how to measure it, and then create a few evals for those high-priority metrics to get started. Using this, you can create a process your team can use to monitor production, look at lots of data, and iterate on both your core AI system (prompts, RAG, etc.) and your evals as you learn.

In this post, we’ll show how Freeplay can be used to incorporate different types of evals, when and how to use them, and a lightweight process we recommend to iterate on an eval suite for your product.

The end goal is always to get progressively better at understanding performance — and faster at improving your AI products for your customers.

The Goal: An Eval Suite For Continuous Product Improvement

First, it’s important to consider what “evals” really are. Jargon can be confusing!

Fundamentally, we’re talking about individual evaluation criteria that can be run on a given sample of data. For any criteria, you’ll want the ability to calculate an aggregate score for that criteria across the dataset, as well as evaluate improvement/regression for individual test cases. When people reference “evals,” they’re often talking about some combination of these concepts interchangeably — the criteria, the aggregate metric, and row-level comparisons.

It can be helpful to think about each on its own. The aggregate metric for a whole dataset might tell you how you’re improving generally. In practice, row-level before/after comparisons are often how you’ll spot real issues and gain insight into what to change in your systems. You might care more or less about different criteria at different times too.

Individual evals can consider things like:

Quality (a tricky one!)
Relevance
Accuracy
Interestingness (originally used by DeepMind to evaluate LaMDA back in 2021)
Rudeness
Schema adherence or formatting

Those might all be generic, but we frequently see our customers define really niche, specific metrics for their use case. For instance, “Did this auto-generated email address the right recipient?” or “Did this transcript summary include the right timestamp citations?” These sorts of custom evals can be the key to helping you really tune a product.

In any AI product development context, there are likely a range of questions you might want to answer to measure whether an LLM is behaving as you want in your product or to measure whether prompt, model, and/or code changes are improving or harming your customer experience.

The goal for any product development team should be to craft an evaluation suite that gives you a reliable way to:

Measure performance and detect issues in production
Measure test performance or experiments to quantify improvement and spot regressions before going to production
Be able to quickly design an experiment and ask new eval questions to help you further improve your product

In practice, most teams look at more than one criteria at once. So, you’re likely going to develop an eval suite, not just a single criteria/metric. It’s also likely you might care about one eval criteria or set of criteria more at a given time — say when you’re experimenting to improve a specific dimension of your product. Yet another reason you’ll want a good eval suite you can deploy as needed.

Different Types of Evals & When to Use Them

The next question is: how best to run these evals? There are three primary modes for executing evals today, and each has it’s benefits in any eval suite.

Human Review (aka “Labeling”)

For all the conversation about automation and AI, human review is still essential and fundamental to AI systems. If you’re building products in this space, you need to be looking at lots of data!

Human review is also the most expensive and time-intensive to run. It’s inherently async. So when and why should you use it?

Confidence: Getting domain experts you trust to look at your data is the only way to really be sure about your system's behavior.
Label ground truth data: This step is essential for a range of reasons, including to be able to tune and refine model-graded evals. Model-graded evals can only be judged as “good” if they align with human expectations, so you’ll need this foundational dataset.
Discover new things: Even if your team is focused on applying specific labels, looking at the data will indirectly surface other issues. This is your best path to discover “unknown unknowns.”

Model-Graded Evals (aka “LLM-as-a-Judge”)

These can feel like the exciting piece. Use AI to evaluate AI! And we’ve found they’re invaluable for accelerating experimentation, testing, and production analysis.

They’re also not always the right tool for the job. They require some work to refine and align with your team’s expectations (as mentioned above), and they also incur an incremental cost and latency hit for each request, even if small.

We’ve found they’re best for:

Answering nuanced questions at scale: They’re way cheaper and faster than humans and do things that code-based evals or string assertions cannot. This makes them invaluable for monitoring production, doing quick experimentation, etc.
Human-like quality for automated testing: There’s simply no other way to run an entire multi-hundred or multi-thousand row dataset through a test suite and get quick, affordable answers each time you need to test.

Code-Driven Evals (aka “Assertions”)

One of the most powerful tools for creating evals is straightforward code. People often talk about these as the “unit tests” for your AI products.

These could range from syntax and grammar checks to regex patterns, to format and style adherence, to numerical calculations like string distance or embedding distance. The best part is, you can run them essentially for free in terms of both cost and latency in most situations.

Use them for:

Any deterministic criteria: Need to do a formatting check, or confirm a string matches a ground truth? It’s always better to use code for this than another LLM!
Run them everywhere, all the time: They can generally be used 100% of the time, any time your code makes an LLM call. And in failure cases, you can quickly retry or trigger fallback logic.

Freeplay’s Approach: Build & Run an Eval Suite as a Team

All that might sound great, but teams quickly begin asking other questions like:

How many evals do we need?
Which ones are the right ones?
When do we run them in code, when do we use models, and how much do people get involved?
How do we know if we can trust them?
What are the “best” evals to use?

The good news: Developing a helpful eval suite is an iterative process that really results from looking at data and learning along the way.

There’s no pressure to get it perfect out of the gate! The more important priority is building a process and tooling you can use to quickly adjust and adopt new evals as you spot issues. With a focused, consistent process, you can get to great, reliable results.

We’ve seen this process form within a range of teams now, from large to small, and AI expert to novice. From observing what’s worked for them — and seeing others in the industry learning the same things! — we think the following process can serve as a solid guide to get started. We’re building Freeplay to make this whole process seamless and easy.

Capture All Your Logs

LLM logs are more than just client and server events. You’ll especially want to make sure you have a good way to look at lots of long strings.

Version your prompt templates, model, and model config (like request parameter settings) so you know exactly what generated the results, and log all completions with this metadata. Freeplay makes versioning easy.
Log customer feedback and behavioral signals, like client events for regeneration attempts or corrected strings. Freeplay makes logging easy too.
Log any relevant automated evals from code or LLMs. Freeplay lets you configure and run model-graded evals entirely on our platform, but you can also record any eval values from your code.

Surface Logs for Human Review

Dedicate time each week to getting hands-on with real-world data. The best teams spend hours a week doing this themselves and don’t try to outsource it all.

Use eval metrics from logs to spot outliers and inspect them.
Check out positive and negative customer feedback consistently.
Always look at a random sample as well to catch things that might be lost in biased metrics or skewed data.
As you review, some best practices (Freeplay, again, makes all this easy)
- Apply tags or leave notes to capture what you observe
- Inspect the eval scores and correct them if needed — if a model-graded eval got something wrong, you’ll want the right label value as a ground truth to improve
- Save interesting examples to datasets for future testing purposes — e.g. a golden set, failure cases, abuse attempts.

Tune your evals or write new ones

For model-graded evals, you’ll want to regularly do some human labeling using the same criteria and confirm your evals remain aligned with your team’s expectations. No matter how great your prompts or models are, this will be worthwhile. And for your other human labels or code-based evals, you’ll discover new ones to add or old ones to deprecate.

For each new issue you discover, consider whether you could update your evals or create new ones that will help you catch that issue automatically the next time. It’s especially easy to add more code-based evals.
As part of configuring model-graded evals in Freeplay, you can quickly review some sample data and confirm alignment to your human labels.

Experiment with fixes

It’s easy for LLM evals to explode in complexity. One practice we’ve seen work well: As you discover and prioritize issues, define focused experiments to improve one specific issue at a time.

Pick an objective to optimize for your experiment & the evals that matter for it.
Test against relevant datasets
Measure & inspect results.

Ship your updates and monitor prod

This is especially important to keep watching what happens in production to spot any skew in results vs. your test cases/environment. Customers will surprise you!

Freeplay makes this easy with Live Filters.

————

If you’ve been spending time here, we’d love to hear your feedback. What else have you found helpful, and what did we miss? Or if it sounds helpful and you want to get started, we’d love to help there too. Either way, please reach out.

Subscribe to our newsletter

Product

Blog

Resources

Company

Pricing

Book a demo

Building an LLM Eval Suite That Actually Works in Practice

Building an LLM Eval Suite That Actually Works in Practice

The Goal: An Eval Suite For Continuous Product Improvement

Different Types of Evals & When to Use Them

Human Review (aka “Labeling”)

Model-Graded Evals (aka “LLM-as-a-Judge”)

Code-Driven Evals (aka “Assertions”)

Freeplay’s Approach: Build & Run an Eval Suite as a Team

Capture All Your Logs

Surface Logs for Human Review

Tune your evals or write new ones

Experiment with fixes

Ship your updates and monitor prod

Subscribe to our newsletter