There’s been lots of conversation recently among our customers, online, and elsewhere about evaluating RAG systems (Retrieval Augmented Generation). And for good reason! It can be hard to know if your LLM system is doing the right thing with basic prompting, and even more so if you’re adding an extra dimension of complexity with RAG. But it’s also essential: good evals are the key to shipping with confidence.
Over the past couple months we’ve made it easy to weave RAG evals into Freeplay, right alongside other features like collaborative prompt management, live monitoring of LLMs, and human-led labeling. You can even configure LLMs to automatically run common RAG evals for you.
In this post we’ll show how it works, and how our approach can give you increased confidence in how your RAG systems are performing as you iterate.
RAG Evals and Why They're Helpful
We previously covered the nuances of evaluating LLM outputs in a product development context, and especially the importance of defining relevant evals for the product experience you’re building. That post briefly touched on RAG evals:
RAG (“retrieval-augmented generation”) evaluations: Where you want to evaluate the retrieved data (context) in relation to either/both the request (question) and response (answer). (e.g. How relevant Is the retrieved context to the original question?)
In other words: Rather than just evaluate how well an LLM response addresses an initial prompt or query, RAG evals additionally help you triangulate between the original user query or input, the retrieved context, and final LLM response.
When it comes to evaluating responses, some popular initial metrics to evaluate include the following. These are possible to evaluate in a single answer context, and don’t require pairwise comparison or ground truth:
Context Relevance: How relevant is the retrieved context to the original query or input? The first step in a RAG pipeline is to provide the right inputs to the LLM to generate a good answer.
Faithfulness: Is the answer directly derived from the provided context, or in other words, is the answer “faithful” to the context? This is a check on whether the LLM is hallucinating, and it’s an important independent variable to assess. Even if the provided context isn’t perfectly relevant, you need to know if the LLM is paying attention to it appropriately.
Answer Relevance: Finally, how relevant is the response to the original query? An LLM response could be faithful to the context, but also confused/distracted by it. You want to additionally evaluate how well the final answer addresses the original query or input.
More advanced evals can help you further understand your retrieval system itself (e.g. context precision & recall), or the correctness of answers. Both depend on ground truth data that needs to be collected and curated.
RAG evals are important to help you understand not just the quality of the eventual LLM outputs, but also to understand how choices in your retrieval system (document chunk sizes, ranking algorithm, etc.) are impacting those outputs so you can improve them. They can help you identify ways to limit hallucinations, to pinpoint opportunities to slash costs by reducing extraneous context, and to spot areas of your retrieval system that you might want to adjust for better results. They also help you generally benchmark your system’s performance in a quantitative way so you know if you’re improving it over time — or detect if quality is starting to degrade.
How Freeplay Can Help Automate RAG Evals
Freeplay’s approach to auto-evals gives you a fast way to get started evaluating the metrics above, as well as the control you need to make them work for your product context & your prompts. We provide recipes for common RAG evals alongside the ability to adjust them for your prompts, and the ability to inspect (and correct!) all row-level results.
Here’s how it works:
Start with a prompt template that incorporates context from a RAG system.
Configure the evaluation criteria you want to use for that prompt and enable auto-evaluations. See screenshot below of setting up auto-evals in Freeplay
Curate a set of test cases that cover common use & edge cases. You can upload these, or save examples you observe via Freeplay.
Generate a test run with a set of results from the prompt/model combo you want to test. Auto-evaluators run automatically on test runs & will produce scores for each evaluation criteria. See screenshot below of an auto-evaluated test run in Freeplay
You can easily inspect any row-level results from there, and even correct the score if you disagree with the evaluation.
Configuring auto-evals in Freeplay
Auto-eval summary after testing a new prompt
With these insights available quickly for any new test, you have the freedom and confidence to continue to iterate on your prompts and RAG system. You also have the transparency to inspect & see exactly what's going on, instead of just receiving a black-box score.
This same auto-eval workflow can easily be repurposed to evaluate subjective criteria like brand voice or tone, to confirm outputs match an expected format, or to mitigate risks like potentially toxic responses from the LLM. Freeplay gives you the ability to define custom evaluation criteria and automatically run them as part of your test process.
Want to go deeper on RAG evaluations? There are lots of great resources online, but we especially found this paper helpful (h/t for the prompt structure pictured above): Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering
Want to go deeper on how Freeplay can help you evaluate & test your own RAG pipelines? Get in touch.