Use Live Filters to Integrate Human & Model-Graded Evals

Jun 11, 2024

A faster, integrated human review and labeling workflow can be the key to unlocking better AI product performance. That’s why Freeplay has made human review workflows a first-class priority from the start. It’s also why PMs, data analysts and QA teams tell us they love using Freeplay alongside their engineering counterparts.

In this post we show how one of our latest features — Live Filters — can be used together with automated evaluations. Together they lead to a faster product optimization loop by giving better insight into production systems and speeding up data labeling workflows.

Live Filters enable targeted monitoring & reviews

The best AI products are built on lots of human-labeled data, and on insights from domain experts. In our conversations with both large and small companies, we’ve found the best product and engineering teams building AI products spend a ton of time reviewing LLM results by hand (even if they have budget for third-party labeling services!).

But it can be incredibly tedious. Too much time is spent today prepping data, shipping it to Excel or third-party labeling platforms, re-interpreting results, and then re-prepping data for use in test datasets or fine-tuning. Plus, less technical team members often need help from data science or engineers to find and access the data they need.

Live Filters in Freeplay make all this easier — giving anyone on your team the power to define complex, custom filters on any metadata from your logs to better monitor your LLM systems and surface targeted datasets for human review and labeling.

Once configured, Live Filters can serve as the source for alerts about issues or changes to your system, as well as queues for regular data review by your team. Together with our existing functionality like production monitoring with auto-evals, Freeplay make it seamless measure what’s happening in your AI products with auto-evals, spot issues, then label and curate data for further testing, fine-tuning, and refinement. (Plus, a bonus of using Freeplay: We automatically make use of labeled data in your datasets to help optimize your model-graded evals.)

Each Live Filter has its own custom graphs that highlight changes over time to your eval metrics, or common performance metrics like cost and latency. New data will flow in continuously as more Sessions are logged.

So far, we’ve seen our customers create Live Filters for things like:

  • Negative, Positive, or Unreviewed Customer Feedback

  • Hallucinations

  • Bad Retrievals (in a RAG context)

  • Expensive Queries

Building these filters in Freeplay is simple and doesn’t require any code. Anyone on your team can create them and share them with others. Here’s a quick video that shows you how:

Human review + auto-evals = 🚀

Together with other evaluation types like model-graded evals and custom functions in your code, a consistent human review workflow is key to understanding how your product is really performing, as well as to tuning your system and improving your overall evaluation suite.

Early on in the AI product development journey, it’s critical to get experts’ eyes on lots of data to spot issues. Their insights are often invaluable to define a suite of relevant evals that can be used to measure an AI system as it scales up. And once it’s at scale, a regular workflow to review and label data is essential to identify hidden opportunities for improvement that get missed in metrics, as well as to curate better datasets for testing and fine-tuning.

For example…

One large public consumer technology company recently described their review process to us for their chatbot. Every couple weeks, their team evaluates a couple of thousand sampled logs from production.

They initially came up with evaluation/review criteria and split up the work across ~20 people internal to their team. It was important to work with their own team who had familiarity with their domain, instead of outsourcing to third-party labeling operations. Two months of work produced not only a high volume of issues to fix after each review, but also surfaced entirely new eval criteria and metrics that would be valuable to track.

In fact, the primary metrics they report on to their exec team today weren’t even in the initial set of evaluation criteria. The new eval criteria emerged from getting a better hands-on sense of what issues were occurring, and where customers were complaining. They’ve since updated their auto-eval suite accordingly, which then further sped up future human reviews by surfacing more interesting data.

Sound familiar? Freeplay’s integrated platform for evals, testing and observability make those types of ongoing review processes much easier. Even a step like reporting to leadership can be made repeatable and faster.

Build a high-impact review processes with Freeplay

As you think about how you might structure a review process like this for your team, let's look at a practical example review workflow that will get results without a ton of overhead and effort:

  1. Define your most important metrics to move, and create Live Filters focused on that data that your team can review regularly.

    • Some evals and guardrails might run only in your code. Log the important ones that you want to track and optimize to Freeplay, or set up model-graded evals in Freeplay directly.

    • Set up Live Filters to focus on the slices of data where you're focused on making improvements — "Negative Customer Feedback" is an obvious one. (You'll additionally want to look at a random sample of other data to make sure you aren't missing things.)

    • Create a new “Review Status” multi-select criteria for human use only that includes optional statuses such as for “Complete”, “Needs PM,” “Needs Eng,” and/or “Needs SME.”

      • Anything marked “Complete” can be optionally hidden from the Saved Filters you set up, so the next person doesn't have to review them too.

      • Anything tagged for review by another group of people can go into a new Live Filter just for their use.

    • Add a “Notes” field to capture details on what you learn from each review.

    • Set an expectation for your team to review these queues regularly — e.g. an hour each Friday over coffee.

  2. Other roles with unique expertise have their own Live Filters like “Product Escalations” or “SME Escalations.”

    • These roles can follow the same process in Step 1.

    • As issues are identified that require making changes to prompts, models or code pipeline, save relevant data into a Dataset in Freeplay, like "June Failure Cases." Launch a test using that dataset to quickly measure improvement on the issues you found.

  3. At any time, any reviewer can:

    • Add interesting examples to a dataset, e.g. “Failure Cases” to test in future iterations, or a “Golden Set” of great examples.

    • Update or correct evaluation scores to build a feedback loop that helps improve model-graded evaluations.

Start using Live Filters in Freeplay

If you’re up and running on Freeplay already, Live Filters are available to you now. Get started right away in the app.

Or, talk to our team about Freeplay by getting in touch here.


© 228 Labs Inc. 2024