LLMs don’t play by traditional software rules. Small shifts in prompts, model versions, or inputs can send outputs in unexpected directions, which makes testing and trust harder than you want it to be.
Evals are the building blocks that define how you measure quality for your AI system, including whether changes make it better or worse. They’re the foundation for a reliable, scalable feedback loop. Without them, you’re left guessing. With them, your team has credible signals to guide development as your product scales.
But there’s a problem: It can be hard enough to learn to trust whether LLMs are doing the right thing. How do you know if your evals are doing the right things, or measuring the right things?.
This post focuses on a single goal: building individual evals you can trust in production. We get into the things that can give your team confidence to deploy changes safely, and a few tips on what to avoid. Read on as we break down how to design and validate an eval that truly supports production decisions.
The Foundation of a Strong LLM Eval
Even a small change to a prompt or model can ripple across your system in unexpected ways. Traditional testing methods cannot capture the nuances, especially when there is no single “correct” answer for some LLM systems. Without reliable evals, teams rely on intuition, manual review, or delayed feedback from production.
Trustworthy evals are:
Relevant: Connect to quality factors your customers value.
Practical: Help you validate improvements and prevent regressions.
Useful: Support your product and engineering teams in making confident decisions.
Fresh: They get reviewed and adapted, as your system evolves.
Importantly, many evals in a product context or for a custom LLM system don’t involve industry-wide benchmarks or common NLP metrics. They function more like unit tests for the specific behaviors that really matter to your product experience.
“Your first eval set should just be a spreadsheet shared in the team… Step one is having a discussion: Are we optimizing for the right things here?”
- Kelly Schaefer, Product Director, Google Labs from the Deployed Podcast
Starting simple and by looking at row level together with your team ensures you agree on the definition of quality before you start to automate anything. It’s a lot easier to scale good judgment than to untangle months of work generating misaligned metrics.
A trustworthy eval suite is an operational unlock. Instead of trial-and-error guesswork, AI development can become a structured, repeatable process (an example of what this looks like from Postscript). What becomes possible when you implement a strong eval suite and strategy?
Track improvements and catch regressions. A solid eval suite validates behavior across multiple dimensions and flags when a change improves one area but quietly worsens another.
Iterate faster and with less stress. By comparing multiple prompt or model versions side by side, you can quickly see which performs best across key metrics like completeness, tone, and faithfulness.
Make quantifiable deployment decisions. A strong eval suite provides hard evidence to justify changes, replacing “trust me” with data-backed confidence.
Align your team around a shared quality bar. Shared evaluation criteria ensure that everyone optimizes for the same definition of quality instead of conflicting priorities.
Nathan Sobo, co-founder of Zed, described why this matters so much for non-deterministic systems:
“I can't stress enough the importance of getting rigorous, systematic automated feedback loops in place... it's never been more true than working with this sort of non-deterministic technology.”
Without a reliable feedback loop, AI development is guesswork. With it, iteration becomes a repeatable process.
3 Common Anti-patterns That Undermine Trust
There are a few common traps that we see many well-meaning teams fall into - it helps to understand the red flags if you see them crop up.
1. Too many evals, not enough focus
While the instinct may be to “measure everything,” a giant matrix of scores can be difficult to interpret and make actionable.
For example, one team we worked with had 25 different evals for their initial eval suite, and just a single prompt. After working together to refine the problems they were trying to solve, they narrowed it down to just 7 that mapped directly to customer value. Decision-making became immediately easier because they weren’t lost trying to decide what to do if 8 were up, 7 were down and the rest neutral.
There’s no magic number to decide the “right” number of evals. Just ensure the evals you’re tracking are the ones you really care about, and that you’ll take action if they move the wrong direction.
2. Generic, misaligned evals
Many teams start with evals that don’t matter in practice.
Some are “checkbox” metrics inherited from open source projects or things people found online, adopted just to have something to measure, but not trusted or used for decisions. Others focus on traditional NLP metrics like BLEU or ROUGE, which they may find ways to improve technically, but they may have little connection to actual customer perceptions of quality. And when this type of evals get treated like business KPIs, teams can end up chasing surface-level numbers instead of meaningful product improvements.
If your evals don’t reflect the quality of your customers’ actual experience, they’re not actionable.
Relevant evals avoid these traps. They measure what actually drives user value, inform engineering decisions, and stay focused on improving real outcomes rather than vanity scores.
3. Static evals
Perhaps one of biggest pitfalls: Treating evals as a “set it and forget it” task. They need to evolve with your product as you improve it, and as issues evolve. Your eval suite needs to adapt as customer feedback rolls in, new edge cases emerge, and use cases change.
If you’re picking actionable evals that you can improve, and ones that matter to your customer experience, you should expect to meaningfully improve on them over time. And as models increase in intelligence, you can expect a natural lift in quality. But if the things you’re struggling with today aren’t an issue a year from now, does that mean you’ve reached perfection? For many AI products and use cases, the answer is “no.” You’ll just have new things to improve.
This is why we encourage teams to iterate on evals along with the other parts of your system. As you discover new classes of issues, add new evals to catch them. And as you successfully climb an eval and you reach a sustained quality bar that you’re happy with, consider whether you still need them. You’ll want to optimize your time on things that really matter to improve.
So, How Can I Start Building a Trustworthy Eval Suite?
If those are some general do’s and don’ts, here are some practical steps you can take to build a trustworthy eval suite:
Start with real data. Use production logs, user queries, and real-world edge cases as the foundation for your eval datasets, so that they’re not biased toward your internal use cases (or bland synthetic data from an LLM). Learn how to build and curate datasets →
Define concise, specific evaluation criteria. Metrics should reflect your product’s success criteria, and catch specific types of failures when they occur. Avoid generalized evals like “quality” that are hard to define. Read our guide on writing effective evals →
Don’t stop looking at your data. Automated scoring is great, but human review data often adds critical nuance and helps you catch the things you haven’t thought to measure yet. See how to mix human labels with automated evals →
Iterate like you would with prompts. Treat evals as living artifacts, refining them as you collect new examples and feedback. The goal is to align them with human expectatations
Run your evals on datasets that are big enough to matter. A few rows of data are generally not enough to decide if a change is better. Many of our customers test production changes on hundreds or thousands of examples to catch edge cases before going to production. Explore Freeplay’s batch testing capabilities →
Visualize and share results. Make it easy for teams to see which specific version of a prompt or agent is better and why. Learn how to compare versions and eval results in Freeplay →
More Resources
Want to dig deeper on this topic? Check out our webinar recording on how to build evals you can trust. Or to see how trustworthy evals can be used in a production system, check out our on-demand webinar on the new AI development workflow.
Want to explore how Freeplay can help you build an eval test suite? Reach out for a demo
Categories
Industry
Authors

Jeremy Silva