~7 minute read
We’ve seen a major uptick in interest in the past 2-3 months among product teams who want to evaluate what LLMs are doing in their product. There’s a growing realization that professional software teams need solid evaluation pipelines to speed up experimentation, monitor production, and optimize prompts for both quality and cost.
All important! And still, we think that’s just the tip of the iceberg in terms of reasons to invest in good evaluations.
At the core, ML systems depend on having high-quality data and a feedback loop to improve what the model does. LLMs are no different. It might seem like magic happens through a REST interface, but if you want to scale production systems that use LLMs, you’re going to need a healthy & up-to-date corpus of well-labeled data to help you address edge cases and optimize outcomes over time.
Your evaluation pipelines can help you get there – creating a data asset you can use for other purposes even as you answer questions about your performance today.
A prior post we wrote here (not just for product managers!) outlines some of our thoughts about the big picture with testing & evaluating LLMs, and how teams can go from zero to automation. This new post gets into details for how you might start to define evaluations that make sense and add value for your specific product or use case.
We walk through the different dimensions for evaluating how well LLMs are performing in your product, including:
What might you use evaluations for?
Why do you need custom evaluations? Why not use general benchmarks?
What are the types of evaluations you might want to use?
What rubrics and scoring methods to use?
How to get started putting them into practice?
We’ll start with some quick background to level-set about what is and is not in scope for this post.
Background
This post is for product development teams who are using LLMs in their product to build one or more features. You might be doing things like:
Building a custom chatbot
Providing a button or other option in a UI that triggers a custom prompt (or chain of prompts)
Creating agentic workflows that respond to a user input and make their way through calling a series of prompts/chains or other APIs until a task is complete
These teams want to know how their custom LLM implementation is performing in their product, for their customers.
The purpose of evaluations — “evals,” as they’re often called by ML practitioners — is to analyze how well LLMs are delivering on a task, which can be a multi-variant answer and can also be subjective in many cases.
In practice, completing an evaluation usually looks like reviewing a collection of data samples and scoring each one using a handful of relevant evaluation criteria, then analyzing the overall results to decide how the system is performing in aggregate.
What might you use evaluations for?
Evaluations provide the backbone for analytics & insights about how well your product is performing, and they can be useful across your software development lifecycle (SDLC).
Things you can do once you’ve set up a good evaluation pipeline:
More quickly experiment and compare different models
Speed up or fully automate your test processes (CI for LLMs!)
Avoid regressions as you iterate, or as model performance evolves
Detect model drift or other production issues
Discover surprising ways your customers use your product
Use the feedback loop from your evaluations to optimize your prompts and LLM request parameters
Curate a foundational labeled data set for fine-tuning and other purposes
Report on performance characteristics for leadership, compliance, or regulatory oversight teams
In short, a solid evaluation suite for your LLM implementation will provide a foundation for building great software with AI. Any one of the examples above can be reason enough to get started.
Why custom LLM evaluations?
If you’re new to this space — like many are — it can be hard to wrap your head around it all, and especially hard to know what’s best to do in your case. There’s a sea of content online that can be confusing, and not always instructive.
Importantly: most of that literature is about evaluating the general characteristics of an LLM.
There’s a big difference between evaluating how well an LLM performs at serving your customers, and evaluating a foundational model like GPT-4 or Claude-2 for things like general knowledge, math skills, human-like responses, or bias. These sorts of general evals are commonly discussed in academic literature, “LLM benchmarks,” and elsewhere online, but we think they’re largely distracting for product teams.
Why? Knowing how well GPT-4 answers math questions likely has no bearing on how well it will, say, summarize a legal document or draft an email as part of your product. Common LLM benchmarks for foundational models might help you pick which model to use, but they won’t likely tell you much about how your product performs.
Similarly, there’s a fair amount of noise about established NLP metrics like ROUGE-N, BLEU, etc. Again, we think these are often distracting for product teams working with LLMs. They’re designed to measure the similarity between two pieces of text, but similarity can actually be a false signal. Two LLM responses might be phrased completely differently but still meet the same customer need, and likewise two responses can be phrased very similarly but the difference is everything.
When it comes to evals for your product, you need to answer relevant questions about how well you’re delivering on your intended customer experience. That likely means defining your own relevant evaluation criteria, and putting them to use by having humans you trust regularly evaluate data samples that come from your system, as well as setting up automated evaluation systems to help you scale.
The rest of this post is intended to help you think about how you come up with those criteria and start using them in practice.
What are the types of evaluations you might want to use?
We’ll outline several different dimensions you might leverage, then share some examples.
First, there are two main categories of evaluations:
Objective: For example, whether a response meets a particular JSON format, matches a ground truth example, or is greater than/less than a target value. This is the domain of traditional testing and is often possible to do with code.
Subjective: For example, whether a response is friendly, helpful, or “on brand.” Humans excel here, and LLMs can be great as well (see for example, "Judging LLM-as-a-Judge").
For each of those categories, the evaluation can be either (borrowing helpful language from the paper linked above):
Single-answer grading: You’re just evaluating one example by itself. No ground truth/reference data needed!
Pairwise comparison: You’re comparing two+ arbitrary examples (say, from different prompt versions), where you want to decide which is best or rank them. (e.g. which is better?)
Reference-guided grading: You’re evaluating a fresh example (or more) in relationship to a reference or ground truth example. That reference example could be good or bad. (e.g. does the new answer match the old one?)
And in each case, the object of your evaluation might be different. For any evaluation (single-answer or comparative), you might focus on:
Response only: where you can evaluate a response entirely by itself without considering the input (e.g. is the format right? Is this good grammar?)
Request/response: where you want to evaluate how well the response addresses the request (e.g. is the response relevant to the request? Was the response creative?)
RAG (“retrieval-augmented generation”) evaluations: Where you want to evaluate the retrieved data (context) in relation to either/both the request (question) and response (answer). (e.g. How relevant Is the retrieved context to the original question?)
Some Examples
In practice, what do these look like? Assume any evaluation criteria has a “question” that needs to be answered, here are some examples:
Scoring & Rubrics for LLM Evaluation Criteria
After establishing your evaluation criteria, you’ll then need to determine the most effective method for scoring examples using these criteria. For the data to be useful, your scoring system should be consistent, as objective as possible (even for “subjective” criteria), and include rubrics that are easy for evaluators to understand and use.
Setting a scale: First, you’ll want to decide on a numerical scale for each evaluation criteria, and it’s likely a choice you want to make on a case by case basis to best align with the criteria you’ve defined. Common choices include:
Binary (e.g., 0 or 1, representing "incorrect" or "correct"): Especially useful for objective evaluation criteria, but a binary score can also be helpful to simplify scoring — especially early on, when you might have more tolerance for rough results.
Range (e.g., 1 to 5, where 1 might represent "Strongly Disagree" and 5 might represent "Strongly Agree"): Yields higher-quality and more useful data especially for subjective evaluations.
A note on picking ranges: We recommend starting with a 1-5 scale, because it’s easier to define a rubric around that you can actually use consistently. It’s very hard for a human to use a 1-100 scale consistently, or even a 1-10 scale.
Comparative (e.g. better/worse/neutral, or ranked choice for >2 examples): In the case of pairwise comparisons, you might decide which is better based on the aggregate scores for evaluation criteria applied to each, OR you might simply have a human make a choice based on their preference. These can be especially hard to do objectively.
Rubrics: Once you’re picked a scale to use, you’ll need to develop clear rubrics to guide scoring. A well-defined rubric provides:
Descriptions for each score on the scale. If you’re asking your team to score a result as a 4, what are the guidelines for what makes something a 4 vs. a 3 or 5?
Along with simple descriptions, you might also consider clarifications to reduce subjective biases in scoring.
Examples or benchmarks for what constitutes each score — especially helpful to clarify the intent of your descriptions!
The good news: Once you invest in defining these for your human raters, you’ll have made significant progress toward defining them for code-based or LLM evaluations as well. It might seem like you could simply ask an LLM “Grade this result on a scale of 1-10!” and skip the rest, but it’s not that simple — you won’t be happy with the quality you get back.
How to get started putting your LLM evaluations into practice?
To kickstart the process of defining your own LLM evaluations, we recommend the following:
Start with the goal: What will success look like for your customers? What will make them incredibly happy, and what will make them disappointed or frustrated?
Define evaluation criteria (including scales & rubrics) that relate to those goals: You likely don’t need to study every aspect of your LLM outputs. What are the most important signals that will tell you if you’re delivering a high-quality product experience, and what might tell you when you’re failing?
Label an initial dataset with trusted team members: Don’t outsource yet, and don’t start with an LLM. The investment you’ll make here will pay dividends later. If you don’t have a better dataset, grab 100 recent production examples. Ideally you could curate a mix of known “good” results, edge cases, and failure scenarios so you can test your new criteria on a real mix.
Bonus: Once you’re happy with this dataset, you can put it right to work to create comparative evaluations or tests as you iterate on your prompts.
Adjust based on what you learned: Inevitably human reviews will teach you where your initial rubrics fail (too vague, not differentiated enough, etc.), where the scale is insufficient, where the criteria itself could be revised, or where you’re missing some valuable criteria that feels important to add.
Consistently review new datasets and iterate: This isn’t a once and done activity, even as you start to automate. Models can drift, customers will use your application in new ways, and you’ll want to make changes over time to optimize. Building a process to have humans regularly review various data samples (i.e. a batch with positive customer feedback, and a batch with negative feedback) from production will keep you informed of how your system is performing, and it will help you continue to grow a ground truth dataset you can use elsewhere. As you find new edge cases and examples in production, you can also add them to your test cases to improve test coverage.
How Freeplay can help
We’re building tools for established product teams to run this process end to end and at scale — including defining custom evaluation criteria, giving humans on your team the workflows they need to quickly sample, score & compare data, and simplifying the feedback loop needed to automate testing & evaluation.
Interested to learn more? Get in touch!
Sample product screenshots below
Conduct custom evaluations on production data using Freeplay
Compare two different prompt versions or models in your code using Freeplay