Webinar: Evaluating and Optimizing AI Agents

Evaluating and improving AI agents and multi‑agent systems requires a new set of tools and primitives for evals and observability.

Teams need reliable ways to hill‑climb agent performance, validate behavior before release, and catch issues in production across the different parts of an agent (like planning, tool use, self‑reflection, and delegation).

In this webinar we’ll share what works and show how to set up a simple, well-structured set of hierarchical evals and datasets to reliably test and improve each component of an agent, as well as the end to end behavior. We’ll walk through a concrete scenario for a multi‑agent system to show the anatomy of testing each part.

This content is especially for teams who are already working on agents, and who want to use offline and online evals together to ship faster and continuously improve agent quality.

What You’ll Learn:

Evaluating and improving AI agents and multi‑agent systems requires a new set of tools and primitives for evals and observability.

Teams need reliable ways to hill‑climb agent performance, validate behavior before release, and catch issues in production across the different parts of an agent (like planning, tool use, self‑reflection, and delegation).

In this webinar we’ll share what works and show how to set up a simple, well-structured set of hierarchical evals and datasets to reliably test and improve each component of an agent, as well as the end to end behavior.

What You’ll Learn:

What’s different about evaluating agents vs. single‑prompt LLM features
How to define and combine prompt‑level and end‑to‑end agent evals
How to measure the quality of different parts of the system in a straightforward way
How to bootstrap an initial agent eval dataset from real examples and define success for your use case
How to combine it all for higher-confidence testing offline any time you make a change, AND to monitor quality online in production

When?

Wednesday, September 24, 10am MT (9am PT, 12pm ET)

You'll Hear From:

Jeremy Silva, Product Lead

Jeremy Silva, Product Lead

Ian Cairns, Co-Founder & CEO of Freeplay

Ian Cairns, Co-Founder & CEO of Freeplay

Morgan Cox, Forward Deployed AI Engineer

Morgan Cox, Forward Deployed AI Engineer

Sign up to attend or receive the recording after!

AI teams ship faster with Freeplay

AI teams ship faster with Freeplay

"Freeplay transformed what used to feel like black-box ‘vibe-prompting’ into a disciplined, testable workflow for our AI team. Today we ship and iterate on AI features with real confidence about how any change will impact hundreds of thousands of customers."

Ian Chan

VP of Engineering at Postscript

"At Maze, we've learned great customer experiences come through intentional testing & iteration. Freeplay is building the tools companies like ours need to nail the details with AI."

Jonathan Widawski

CEO & Co-founder at Maze

"The time we’re saving right now from using Freeplay is invaluable. It’s the first time in a long time we’ve released an LLM feature a month ahead of time."

Luis Morales

VP of Engineering at Help Scout

"As soon as we integrated Freeplay, our pace of iteration and the efficiency of prompt improvements jumped—easily a 10× change. Now everyone on the team participates, and the out-of-the-box product-market fit for updating prompts, editing them, and switching models has been phenomenal."

Michael Ducker

CEO & Co-founder at Blaide

"Even for an experienced SWE, the world of evals & LLM observability can feel foreign. Freeplay made it easy to bridge the gap. Thorough docs, accessible SDKs & incredible support engineers made it easy to onboard & deploy – and ensure our complex prompts work the way they should."

Justin Reidy

Founder & CEO at Kestrel

AI teams ship faster with Freeplay

How to measure the quality of different parts of the system in a straightforward way
How to measure the quality of different parts of the system in a straightforward way