The New AI Development Workflow: Building Blocks for a Better Product Feedback Loop

Learn how leading AI teams run a continuous feedback loop for their generative AI products and agents.

When you’re building products around generative AI models, nearly everything changes about how you build, and how you operate software teams.

During a recent webinar on these topics, our CTO Eric talked about a common experience for engineering leaders: At first glance, you might not think much has changed, and lots of teams take on big AI initiatives without revisiting their approach to the development lifecycle. But then anyone who spends real time building with LLMs quickly learns that real changes are needed.

On the surface, there are entirely new primitives for anyone who’s been building traditional software — prompts, models, evals, datasets, etc. The workflows and skills are different too:

Design is far more iterative. since you rarely know up front exactly how models will behave, what data or context is needed, etc.
Testing is non-deterministic. You can run the same prompts (or even evals!) three times and get different results.
Quality has multiple dimensions. Now instead of checking for code errors and HTTP responses, many generative AI use cases require checking for whether results meet human standards for correctness, helpfulness, and even subjective taste.
Observability becomes about understanding AI and user behaviors.
Team roles are shifting, with domain experts playing an increasingly central part on many development teams, and taking the lead to define what “good” looks like.

It’s no wonder that once teams move from prototype to production, things can quickly feel overwhelming if they haven’t anticipated these changes. What looked easy at first can become difficult to operate at scale. Without a better way to work, progress slows, teams get bogged down in trial and error, and customer experience generally suffers.

So what’s the alternative?

Teams building AI products need to break out of the linear “design, build, test, ship” mindset from a traditional SDLC, and intentionally adopt a continuous feedback and iteration loop with evals at the center.

Below we introduce the workflow we’ve seen work best in practice, and we’ll share a deeper dive on each part in the coming weeks.

Want to jump ahead? Watch the full recording from our last webinar on The New AI Development Workflow.

The Build–Test–Observe Loop, Powered by Evals

The continuous loop between these three stages is how leading AI teams keep shipping quickly while improving quality, even as models, prompts, and user behavior evolve under them.

Build

This is where ideas take shape, and where you’ll come back to every time you decide to iterate on your system.

You might write a new prompt, refine an existing one, test alternative models, wire in new tools, or adjust how your agent orchestration works. The focus at this stage is generally on getting a version well in the small set of use cases you care about most. Look at things by hand, as well as with your tests and evals — but not at scale yet.

It’s about proving you can make the system do what you want enough that it’s ready to evaluate at scale.

Test

Once you've got a working prototype or change, you then want to validate it before it goes to production.

This is when you're ready to run your new build or updates against large enough datasets that they reflect the full range of real-world use that you’ll see in production. Those datasets ideally include common use cases as well as tricky edge cases, and known failure states. An ideal “golden set” is relevant in situations where that makes sense (e.g. categorization or fact-finding use cases), but it's not always the case that you have a true "ideal" for many generative AI use cases.

Your goal at this stage is to validate what you’ve built works at scale, whether you’re doing that through the Freeplay UI, in a notebook, or in a CI pipeline (or all three!). Evals become especially valuable at this stage to score results at scale, compare to prior versions, and help spot issues that still need attention before you ship. You want as much confidence as possible before you get to production.

Observe

Once you're live, you then want learn from the real world — and evals are still relevant at this stage. Once a change ships, the goal becomes to figure out where it behaves as expected, and where it doesn’t.

Generative AI observability should help you capture and surface logs for every LLM interaction and agent run, incorporate any customer feedback or signals from your client application that tell you about quality, run online evaluations on live traffic to score each instance (or a sample), and surface issues for human feedback. Ultimately, for domain experts and engineers to investigate anything unexpected.

Whether it’s a drop in accuracy, a spike in latency, or a new class of customer requests that you hadn’t planned for, these insights feed directly back into the next Build cycle to inform what to work on / what to fix.

Evaluations

Evals ideally serve as the connective tissue across every stage. They play a critical role to help you specify and scale how you assess quality for your product, turning subjective impressions into measurable signals.

At the Build stage, evals help you know when a prototype is good enough to bother testing at scale. At the Test stage, they give you clear pass/fail criteria and the ability to quantitatively compare a new version to an old one. In Observe, they let you score production data automatically and surface the gaps vs. your test scenarios (and then ideally incorporate those examples into future test sets). Unlike many traditional ML use cases, many LLM evals can provide meaningful signals without ground truth, which means you can run the same evals (or the subset that don’t require ground truth) both offline and online.

A strong eval suite you can use at each stage is often the difference between trial-and-error guesswork and a disciplined feedback loop your whole team can trust.

What The Loop Unlocks

Running a continuous feedback loop in this way transforms how teams operate. Instead of waiting for customer tickets to reveal problems, they can spot issues earlier, test fixes with confidence, and ship faster. The goal is simple: build AI products that work reliably in the real world, while moving quickly enough to keep up with the pace of change.

By running this loop, teams create the confidence to hit their end goals through their ability to:

Monitor production for issues before customers raise issues
Identify and prioritize changes to improve quality
Test every change with confidence before shipping to production
Move faster by enabling the whole team — engineers, PMs, and domain experts — to contribute

Postscript is one example of a team that found a better way forward using a process like this. By building evals that actually mattered for their product, they gained the ability to confidently test every change before it reached production. That shift gave them the foundation to launch and run a new agent product with confidence (read more about that here.)

There are also constant stories from guests on our Deployed podcast talking about similar learnings. If you want to hear how others have tackled related problems, check out our conversations with Kelly Schaefer at Google Labs, Nathan Sobo at Zed, Tyler Phillips at Apollo.io, and Ben Kus, CTO of Box.

The Build–Test–Observe loop is not just a theory. It is the workflow pattern we’ve seen power some of the fastest-moving production AI teams. When you design your team and pipelines to operate in this continuous cycle, you can respond to model changes, user behavior shifts, and new product opportunities without slowing down or losing quality.

To learn more about how Freeplay can help your team with this workflow, reach out.

Subscribe to our newsletter

How to Test and Optimize Multimodal AI Workflows with Freeplay

Product

How to Test and Optimize Multimodal AI Workflows with Freeplay

Product

Use Evals to Pick the Best Model for Your Prompt in Under 5 Minutes

Product

Use Evals to Pick the Best Model for Your Prompt in Under 5 Minutes

Product