Use Custom Evals to Monitor & Measure Production LLM Systems

Jan 26, 2024

Freeplay recently launched the ability for our customers to monitor and measure performance of live LLM pipelines using custom evaluation criteria they select or define. Now it’s easy to know what’s really happening in production, with metrics that you select that are directly relevant to your product or feature.

Check out the video at the bottom of this post to see it in action.

Screenshot of the updated Freeplay Sessions dashboard

Why do product development teams need custom metrics for LLM observability?

Product development teams depend on monitoring and analytics to know what’s happening with their systems, but until now it’s been a pain to monitor LLM use in production. Traditional observability tools are built to count stuff that’s easy to count — events, cost, latency, etc. These can helpful, but they’re not sufficient for LLMs.

When inputs and outputs are unstructured text or code, you need to evaluate the actual content to know if it’s any good. That requires either manual human review, running code, or the additional use of LLMs.

Then, there are so many dimensions you could evaluate. If you’re generating content:

  • Is the tone & format right?

  • Did the LLM hallucinate?

  • Did it generate something that could be considered negative or biased?

  • If you’re building a chat interface or similar, were your customers asking sincere or off-topic questions?

  • What about specific questions for your use case? For example, in a customer service context, was your customer's question resolved?

  • And, however you/your customers define the answer, was it “good” or “bad”?

In another post, we go deeper into all these dimensions for defining the right eval criteria for your product or feature. Anyone building in this space needs to select and set up eval criteria early in their build process to be able to test and evaluate performance. As Greg Brockman from OpenAI has said well:

A better way: Monitor & measure your LLM system with Freeplay

For a while now, Freeplay has made it easy to select and configure custom evaluation criteria like that bullet list above for batch testing scenarios (see our Test Runs feature), and then automatically score those test results. It’s also been easy to filter through live sessions, find the ones you want, and then evaluate them.

But, both these methods are limited to your selection bias, either in your test data or the filtering selections you made. What’s happening in the places you haven’t thought to check yet? We frequently hear teams tell us they test with ~100 test cases and then quickly scale to 100K's in production, without a good sense of what customers are actually experiencing. LLM observability is still a big challenge.

With Freeplay’s new live monitoring feature, any “single answer” evaluation criteria used in test scenarios can be extended to automatically evaluate a sample of live production data as it comes in. These metrics get displayed alongside more traditional metrics like cost and latency.

Getting started

Setup is easy, here’s how it works:

  1. Set up evaluation criteria in Freeplay

  2. Enable Auto-Evals

  3. Set a sampling rate

Here's an example screenshot for a RAG-specific eval focused on evaluating whether provided context was relevant to a question. The video below walks through this process from end to end.

Once configured, Freeplay’s “Sessions” browser makes it easy to dig into the data.

  • See trends and spot issues, from a spike in traditional metrics like p90 cost or latency, to changes in LLM-powered evaluation metrics (e.g. RAG metrics like Answer Relevance, or fully-custom metrics you define like Quality or Brand Voice)

  • Filter across all your Sessions that meet a criteria of interest

  • Easily launch a human review to inspect the details and make changes to your prompts, models, RAG pipelines, or otherwise as appropriate

Since Freeplay makes it easy for human reviewers to correct or add new scores as part of any review, we can also help our customers gain trust in their auto-evals and optimize them over time based on human feedback. Check out the video to see more.

As with the rest of Freeplay, we’ve built our auto-evals feature to be transparent and configurable by our customers. We provide recipes for common eval metrics (like RAG relevance, faithfulness/groundedness, etc.), as well as full flexibility to create complex, custom evals specific to your prompt templates and product context. Learn more about configuration options here.

With this launch, it’s now possible to use them across the entire product development lifecycle, from early testing and experimentation to production monitoring and optimization. Curious to learn more or want to get started? Get in touch here.

© 228 Labs Inc. 2024