Webinar: Evaluating and Optimizing AI Agents
Evaluating and improving AI agents and multi‑agent systems requires a new set of tools and primitives for evals and observability.
Teams need reliable ways to hill‑climb agent performance, validate behavior before release, and catch issues in production across the different parts of an agent (like planning, tool use, self‑reflection, and delegation).
In this webinar we’ll share what works and show how to set up a simple, well-structured set of hierarchical evals and datasets to reliably test and improve each component of an agent, as well as the end to end behavior. We’ll walk through a concrete scenario for a multi‑agent system to show the anatomy of testing each part.
This content is especially for teams who are already working on agents, and who want to use offline and online evals together to ship faster and continuously improve agent quality.
What You’ll Learn:
Evaluating and improving AI agents and multi‑agent systems requires a new set of tools and primitives for evals and observability.
Teams need reliable ways to hill‑climb agent performance, validate behavior before release, and catch issues in production across the different parts of an agent (like planning, tool use, self‑reflection, and delegation).
In this webinar we’ll share what works and show how to set up a simple, well-structured set of hierarchical evals and datasets to reliably test and improve each component of an agent, as well as the end to end behavior.
What You’ll Learn:
What’s different about evaluating agents vs. single‑prompt LLM features
How to define and combine prompt‑level and end‑to‑end agent evals
How to measure the quality of different parts of the system in a straightforward way
How to bootstrap an initial agent eval dataset from real examples and define success for your use case
How to combine it all for higher-confidence testing offline any time you make a change, AND to monitor quality online in production
When?
Wednesday, September 24, 10am MT (9am PT, 12pm ET)
You'll Hear From:


Jeremy Silva, Product Lead
Jeremy Silva, Product Lead


Ian Cairns, Co-Founder & CEO of Freeplay
Ian Cairns, Co-Founder & CEO of Freeplay


Morgan Cox, Forward Deployed AI Engineer
Morgan Cox, Forward Deployed AI Engineer
Sign up to attend or receive the recording after!
AI teams ship faster with Freeplay
"Freeplay transformed what used to feel like black-box ‘vibe-prompting’ into a disciplined, testable workflow for our AI team. Today we ship and iterate on AI features with real confidence about how any change will impact hundreds of thousands of customers."
Ian Chan
VP of Engineering at Postscript
"At Maze, we've learned great customer experiences come through intentional testing & iteration. Freeplay is building the tools companies like ours need to nail the details with AI."
Jonathan Widawski
CEO & Co-founder at Maze
"The time we’re saving right now from using Freeplay is invaluable. It’s the first time in a long time we’ve released an LLM feature a month ahead of time."
Luis Morales
VP of Engineering at Help Scout
"As soon as we integrated Freeplay, our pace of iteration and the efficiency of prompt improvements jumped—easily a 10× change. Now everyone on the team participates, and the out-of-the-box product-market fit for updating prompts, editing them, and switching models has been phenomenal."
Michael Ducker
CEO & Co-founder at Blaide
"Even for an experienced SWE, the world of evals & LLM observability can feel foreign. Freeplay made it easy to bridge the gap. Thorough docs, accessible SDKs & incredible support engineers made it easy to onboard & deploy – and ensure our complex prompts work the way they should."
Justin Reidy
Founder & CEO at Kestrel
AI teams ship faster with Freeplay
"Freeplay transformed what used to feel like black-box ‘vibe-prompting’ into a disciplined, testable workflow for our AI team. Today we ship and iterate on AI features with real confidence about how any change will impact hundreds of thousands of customers."

Ian Chan
VP of Engineering at Postscript
"At Maze, we've learned great customer experiences come through intentional testing & iteration. Freeplay is building the tools companies like ours need to nail the details with AI."

Jonathan Widawski
CEO & Co-founder at Maze
"The time we’re saving right now from using Freeplay is invaluable. It’s the first time in a long time we’ve released an LLM feature a month ahead of time."

Luis Morales
VP of Engineering at Help Scout
"As soon as we integrated Freeplay, our pace of iteration and the efficiency of prompt improvements jumped—easily a 10× change. Now everyone on the team participates, and the out-of-the-box product-market fit for updating prompts, editing them, and switching models has been phenomenal."

Michael Ducker
CEO & Co-founder at Blaide
"Even for an experienced SWE, the world of evals & LLM observability can feel foreign. Freeplay made it easy to bridge the gap. Thorough docs, accessible SDKs & incredible support engineers made it easy to onboard & deploy – and ensure our complex prompts work the way they should."

Justin Reidy
Founder & CEO at Kestrel
AI teams ship faster with Freeplay
"Freeplay transformed what used to feel like black-box ‘vibe-prompting’ into a disciplined, testable workflow for our AI team. Today we ship and iterate on AI features with real confidence about how any change will impact hundreds of thousands of customers."
Ian Chan
VP of Engineering at Postscript
"At Maze, we've learned great customer experiences come through intentional testing & iteration. Freeplay is building the tools companies like ours need to nail the details with AI."
Jonathan Widawski
CEO & Co-founder at Maze
"The time we’re saving right now from using Freeplay is invaluable. It’s the first time in a long time we’ve released an LLM feature a month ahead of time."
Luis Morales
VP of Engineering at Help Scout
"As soon as we integrated Freeplay, our pace of iteration and the efficiency of prompt improvements jumped—easily a 10× change. Now everyone on the team participates, and the out-of-the-box product-market fit for updating prompts, editing them, and switching models has been phenomenal."
Michael Ducker
CEO & Co-founder at Blaide
"Even for an experienced SWE, the world of evals & LLM observability can feel foreign. Freeplay made it easy to bridge the gap. Thorough docs, accessible SDKs & incredible support engineers made it easy to onboard & deploy – and ensure our complex prompts work the way they should."
Justin Reidy
Founder & CEO at Kestrel