Prompt Engineering for Product Managers, Part 2: Testing & Evaluation
This post is part 2 in a series to help product managers ramp up on prompt engineering. You can also check out part 1 on Composing Prompts.
There’s a common pattern we’ve seen product teams go through when they start building with LLMs. Early on, it can almost feel like elation: “We had an idea, and OMG this is amazing, it works!” But shortly after the excitement from an initial prototype, uncertainty sets in. Odd edge cases emerge in further testing, or simply “meh” results, and it’s unclear how to manage them all. People start experimenting with different prompts & data retrievals methods, and then suddenly realize they need to test all those prior edge cases again. They have to wrangle a bunch of test cases just to confirm they’re on the right track (usually doing so in a spreadsheet). Then, every time their code or prompts change, they need to do it all again.
Getting to a promising prototype with an LLM is easy, but getting to production with confidence is hard. Both for the sake of delivering a great customer experience and limiting downside risk, you want that confidence.
This is where it becomes essential to develop a solid testing & evaluation process — and it’s often not obvious even to experienced software teams what this should look like. But most teams know what they want. They’re used to continuous integration testing where they can make one change and automatically test a system to know if the change has broken anything else. They want the same confidence and freedom to iterate when it comes to working with LLMs.
In this post, we’ll walk through some of the challenges testing LLMs & tactics to address them, as well as a pragmatic approach product teams can take to build an increasingly more mature testing & evaluation process over time.
First, why is this so hard?
Image from the paper Herding AI Cats
Traditional software is comparatively easy to test: Does the code still do what we think it should do in each situation? Even other types of ML can be easier. Is the picture of a cat or a dog?
When it comes to building products with LLMs, evaluating the outputs is often challenging because determining a “good” result can be very multi-dimensional.
Is the output accurate?
Is the tone right, or “on brand?”
Is the format right so it can be used in our application?
Is the response from this version of our prompt better than the last one?
Does the improvement in this one case cause a degradation in other cases?
Then, say you discover issues and want to improve a response or class of responses. It can also be hard to know how to do so.
There are near-infinite knobs to turn: Different ways to phrase a prompt, different models & providers to choose from, different request parameters that will change the response, different methods to find & retrieve relevant data needed for a dynamic prompt. There’s also the reality that LLMs can be unruly. They occasionally hallucinate, and small tweaks in a prompt can lead to unexpected changes.
In short: it’s much more complicated than testing whether clicking X button still leads to Y action in a database, and fixing an offending line of code if it doesn’t.
The basics: What do you need to test responses from an LLM?
At the simplest level, your “tests” for an LLM-powered feature will be made up of input/output pairs for a specific prompt version or chain of prompts, which you’ll then evaluate using relevant criteria (which we’ll talk about more below). For any prompt or prompt chain that makes up a feature in your software, you’ll eventually want to develop a collection of test cases (inputs at least, and likely input/output pairs) & evaluation criteria you can run quickly for any new iteration of your software. Think of those collections as a “test suite” for your feature.
Test components
Prompt template version: We assume that if you’re building a feature in your product with an LLM, you’re using a prompt template that you then add a specific query or data to each time it runs (the “inputs” below). As you iterate, you’ll want to keep track of the combination of a specific phrasing for a prompt template, as well as the model (and version) you’re sending it to, and the request parameters you use. Each unique combo of these elements makes up a version for your prompt template.
Relevant inputs to your prompt: Here you’ll want to come up with a range of examples that might happen in your product. And not just the happy path examples! You’ll want to test edge cases and potential abuse vectors as well. Each time you create a new prompt template version, you can test it with these inputs.
Outputs/results: You’ll also want to capture what responses you received, so that you can compare different prompts and see the quality of responses in production. These will be the primary focus of your evaluation.
For some evaluation criteria, you’ll also need prior output examples to compare to. If your prompt should have a deterministic answer, you could write these examples yourself. In many cases we’ve seen though, there’s a wide range of possible “good” answers, so it can make more sense to focus on capturing observed outputs and simply comparing these to new ones.
Running those tests for different prompt template versions might then look like this.
Types of evaluation criteria
Then, for any given set of outputs, you’ll want to evaluate them using relevant criteria that matter for your use case or product. These can vary widely. Some can be evaluated on a standalone basis (looking at just one example output on its own), while others require comparing two or more outputs. There are also different ways evaluations can be executed — by code, by an LLM, or by a human.
Below are some examples:
Can be tested with code: Does the result mention X word or phrase? Does the response match our expected JSON format?
Can be tested by an LLM: Do the facts/substance in this response match a ground truth example? Does the tone or writing style match?
Can only be evaluated by a human: Does this complex response seem right & helpful to an expert? Is this new version of a response better or worse than a prior example?
A note on evals for research vs. evals for your product
A lot of writing exists online about standard NLP benchmarks or general evaluations for testing foundational LLMs. These are great for research purposes to evaluate the general performance of a model, but we’ve found they’re often irrelevant for testing specific product features. They’re often geared toward things like evaluating whether a foundational model answers common questions accurately, can pick the best way to complete a thought or sentence, or whether a model exhibits bias. Most of the time, you'll want to know how well your prompt/prompt chain performs for your specific customer use cases, and those benchmarks don't help.
So what do you do about it? A pragmatic approach
While some teams have done similar testing & evaluation for ML systems in the past, those teams are not the norm today. It can be overwhelming to think about building a comprehensive, automated testing & evaluation process when you first get started with LLMs.
Good news: you don’t have to.
We’ve seen many product teams take a pragmatic approach that allows them to mature their testing process over time — going from eyeballing results at the beginning, to an objective and quantitatively-driven testing process at the end. When you know the eventual goal is a comprehensive test suite you can quickly run for your feature, you can incrementally build toward it over time.
What could that process look like?
Step 1: Organize & save relevant input/output pairs from the start. Have an example you know you want to use every time? Save it in a structured way that you can easily reference and re-use. Observe an edge case or failure scenario that you know you want to prevent in the future? Save it. You don’t just update your prompt to say “Don’t do X” — you’ll also want an example input that produced X before, so you can make sure it doesn’t happen again. You can always throw out examples later, but early on we see teams wishing they had a structured list of example inputs & outputs they can use to test new versions of their prompts.
Step 2: Run all your example inputs through new prompt or code versions. Just like an integration test, you’ll want to understand how all the important examples perform before you ship an update to customers. Early on, you can do this sort of testing for a given prompt version through an online playground, but you’ll soon want to be able to run the same tests through your code too.
Step 3: Do human evaluations within your team. In the early days, many teams can’t even articulate the eval criteria that are most important to them. Getting hands on with actual results will help you define what your eval criteria need to be. And at the end of the day, a simple preference comparison can be enough to get started for most teams since you’re simply deciding whether a new version of a prompt is worth shipping compared to a prior version.
Step 4: Define eval criteria over time and add them to your test process. As you discover different failure cases and learn which evaluation criteria really matter, add them to your test process. For instance: perhaps it’s relevant to evaluate every time whether an output is on-brand and meets conforms to a specific JSON format. For each, there should be some objective pass/fail or comparison criteria you can use to develop a scoring rubric for a given test. Over time this means you’ll develop an increasingly robust test suite of examples and eval criteria. These criteria can be assessed by a human at first, until you’re ready to automate them.
Step 5: Automate your testing. Once you know what you need to test, the only way to scale up will be with automation. You’ll need a way to automatically run your input test cases through your prompts & code, then run your evaluation criteria automatically (unless you need human reviews, in which case you’ll want a streamlined workflow). It’s likely you’ll end up with a combination of evaluation criteria that can be run by code and/or an LLM, which will let you get close to automated testing. Even if you’re just doing a quick check of these results with human experts, you’ll be much more efficient than testing each example by hand every time.
Step 6: Run a sample of production data through your evaluations. The last step on the maturity curve looks like running new examples through your same evaluation criteria to understand what’s actually happening in production. This will help you avoid “drift” (when something changes in the model over time), as well as identify gaps or opportunities to improve your prompts.
Making testing & evaluation easier
Freeplay is built to help product teams adopt exactly this process and set of practices. We’ve already made it easy for teams to capture input/output examples whenever code runs, organize that data into test sets, run test sets through your code for a full integration test, and then compare results to prior versions. On deck: making it easier to define custom evaluation criteria. A sneak peek is below, we’d welcome your feedback!
Interested to keep in touch? You can sign up for our newsletter here, or join the product waitlist here.
Coming soon
Test Run dashboard
Comparing two results