As more software companies mature how they run and iterate on generative AI products, they’re quickly realizing the importance of consistent data review and labeling by human experts. This post goes deeper into why it’s worth the effort, some current best practices, and how Freeplay helps cross-functional teams speed up data review and labeling. If you’re thinking about how to structure better data review workflows for your AI products, this is for you.
———
It’s no secret: Getting LLMs to work well in production applications takes work. For all the chatter on Twitter about the latest research techniques, open source libraries, and other supposed silver bullets to magically generate high-quality LLM outputs, the teams actually building products around generative AI today know it takes a non-trivial amount of iteration and human effort.
Much of that effort comes from people with sufficient expertise looking at row-level data. And not just looking at it, but labeling it, curating it into various datasets, and then either making recommendations to improve evals, prompt or data pipelines, or else writing those evals and making those adjustments themselves.
Recently, we’ve seen a big uptick in interest from software teams building with generative AI who are seeking a better way to organize human data review and labeling, curate datasets, and write and tune their evals. As they ramp up investment in human review workflows, they’re looking for both best practices and better tools. These types of efforts and workflows are well understood by traditional ML teams, but they’re often new for product development organizations building with generative AI for the first time.
Why human review? Aren’t we past that?
Given all the promise of AI automating away tedious tasks, a lot of development teams are surprised at first by just how valuable it is to have expert humans review lots of row level data from AI systems. The inverse is true though for teams that are well down the path of running AI products in production. They’ll often talk about how their data review and labeling work is the key to unlocking high-quality AI product experiences.
So much so that “look at lots of data” is a bit of a meme in AI engineering circles. From the well-read O’Reilly piece published in May, “What We Learned from a Year of Building with LLMs (Part II)”, one of the authors’ key suggestions was “Look at samples of LLM inputs and outputs every day.” It’s been repeated often since.
We’re seeing industry-wide conviction build around this trend, and it’s turning into operational reality for mature product teams. AI engineering teams are going from ad hoc review processes to well-organized, dedicated efforts. In just the past two weeks, we’ve had:
A large unicorn onboarded a group of data labeling contractors to Freeplay to do regular review of production logs and offline experiment results, working in tandem with PMs and engineers
A Fortune 500 tech company onboarded a group of full-time in-house data analysts to do the same, as well as to help curate datasets and eval benchmarks
A successful growth-stage SaaS company even designated a new full time role for an “AI Product Quality Lead” to work alongside engineers & PMs. That person is responsible for coordinating data review and labeling, defining evals and dataset strategy, etc. (Here’s a snapshot of their job description)
These companies are making this investment because the constant feedback loop between expert humans and their AI systems is one of the biggest levers to improving product quality. Leaders of these teams also talk about ROI in terms of increased product and team velocity, and in terms of decreasing costs (especially relative to having ML engineers & data scientists do the same work, since the work has to happen either way).
It’s widely understood that RLHF was the big unlock that led from GPT-3 to ChatGPT, but it’s true of other products built on top of LLM APIs too. Expert human feedback is a key way to detect nuanced issues in AI products, inform prompt and model changes, suggest new or better evals (to automatically catch issues in the future), and curate helpful datasets that can be used for better testing, fine-tuning, prompt optimization, and more.
Ok, got it. What are humans actually doing?
Whether a team has dedicated data analysts (analysts, contractors, internal SMEs, etc.) or the work is being distributed across existing Engineering, Product and existing QA or Support teams, the core work that needs to be done is largely the same.
Look at lots of data.
People need to spend time with row-level prompts and completions. The unpredictability of LLMs combined with the nuances of many LLM use cases make it invaluable to have expert human eyes on data to catch things that would otherwise be missed by current monitoring and eval systems, especially when those systems are nascent. A best practice here would be to regularly look into known issues (e.g. negative customer feedback, eval failures) as well as a random sample of production data to catch potential “unknown unknowns.”Label it.
All that looking isn’t worth much if you don’t organize and keep track of what you find. Think through what you need to know to take action as a team, and don’t over-do it. The value of labels is in how they get used. Common practices we see here include categorizing completions into types of failures (and successes), rating completions as good or bad on some scale, and tagging certain completions for escalation to some other person or group (e.g. Engineering, Product, etc.). Then those labels can be used to describe themes and take action to improve.Create or tune evals.
For some teams this means passing suggestions over the transom to engineers or PMs, but Freeplay also makes it possible for the same subject matter experts who do the labeling to write and tune auto-evals. Whoever does it, a best practice after reviewing a bunch of data would be to write evals to catch the most common issues so you can automatically detect and work to prevent them in the future.Curate interesting data into datasets.
Beyond simply applying labels, AI engineering teams need well-curated datasets for other purposes like testing or fine-tuning. It’s important to align as a team up front about which datasets are needed and how they’ll be used. Otherwise it’s easy to clutter them with noisy or skewed examples. Whoever adds data to these datasets needs to understand clear criteria for why and when new examples should be added. We frequently see at least the following in practice:Golden sets of good answers that you can use in testing to identify regressions (or improvements)
Failure datasets that include a specific class of failures, so you can quickly validate you’ve fixed them with a new experiment
Benchmark datasets for your evals that set a baseline for how each eval is supposed to behave
We talked about some of these workflows more in this prior blog post if you want to go deeper: Building an LLM Eval Suite That Actually Works in Practice
How Freeplay can help
From the very beginning of building Freeplay, our vision has been to give everyone working to build AI products better tools to collaborate with each other – including engineers, product managers, QA analysts and data labeling teams. We’ve found that when these roles work closely together, they get to better outcomes faster. In this case, having everyone look at and label data in the same place where they’re managing datasets and launching experiments allows them to get done in an afternoon what might have taken weeks before Freeplay.
Take a look at some of the key ways Freeplay helps with data review.
Build multi-player data labeling workflows
Teams of analysts can build custom queues using our Live Filters feature, and define their own labeling criteria. This allows them to work as a team to quickly review and label interesting data, for instance “all unreviewed completions in production that got negative feedback or “failed evals.”
As they work, they can update review status, leave comments and notes for each other, or escalate for additional review by someone else. For example, a manager could have a separate queue for “review status ‘In progress’ labeled as ‘Needs review by SME’”. Find an issue that needs attention? Drop a link to an engineer or PM that they can easily inspect, or even open in the playground to experiment quickly with a fix – no more copy/pasting examples around to different places.
Label data and improve eval quality at the same time
Any model-graded eval labels are presented in the same place when other labeling happens. Human reviewers can correct or confirm eval values, which turn into benchmark datasets for improving auto-eval quality. This streamlines the process so that prompt and eval quality don’t have to be worked on separately. We go deeper and share a demo video in this post.
Build datasets in the same place that you use them
Often we see datasets get managed in spreadsheets or other disparate systems that are hard to keep up to date. By having everything in one place, it’s easy to launch new tests or experiments and trust you’ve got the most up-to-date data.
Integrated experimentation
For the same reason it’s helpful to manage datasets in the same place, Freeplay also helps make experimentation easier. Anyone on the team can launch a new prompt or model experiment and run a full eval suite against it – even non-engineers. See something out of place and think you know a fix? Oopen up your datasets in our playground, make an update, run a test, and drop a link to Product & Engineering with suggestions – all without deploying code or having to use an IDE.
Wrapping up
Success in AI development hinges on creating a tight feedback loop between human expertise and AI systems. As more teams mature from prototype to full production deployments of generative AI, we expect to see more emphasis on human-in-the-loop processes like this. Investing in better human review workflows now is one of the best ways to make quick improvements to your AI product development. Reach out to find out more about how Freeplay could help.