TL;DR: Postscript used Freeplay to build out the eval suite and experimentation workflow they needed to safely scale an AI-powered e-commerce agent that sends personalized messages with no human in the loop. Using offline evaluations, datasets that provided test coverage for 1000’s of merchants, and prompt iteration tooling from Freeplay, they shipped a high-quality agent product on a hard deadline, validated both test and production outputs confidently, and built trust with their large, established customer base.
What Postscript Is Building
Postscript is building something ambitious: an AI-powered conversational agent called "Shopper" that sends personalized SMS messages directly to consumers on behalf of thousands of e-commerce brands. The goal is 1:1 personalized conversations with customers.
Since messages go out without any human editor in the loop, trust in the agent is critical once it’s deployed. The impact of every change needs to be validated across a wide range of customer scenarios. The challenge is scaling review of those validations, where a manual human review of every new message just isn’t possible.
Enter Postscript’s suite of custom evals, created and managed with Freeplay.
"Everyone who thinks about (evals) deeply knows that it’s actually the most important part of your system…When I talk to my team, I say everything’s downstream of a good eval. If you want to have a good prompt, the only way you know how to do that is with a good eval" - Evan Romano, Lead Prompt Engineer
Whether it’s a model change or a prompt tweak, the Postscript team needs to know every update is a safe bet before anything ships. And as they’ve sought to build trust with their customers, they want a way to quantify their confidence for each customer..
Part of their solution has meant involving domain experts from their team in a tight feedback loop with engineering. They needed a system that let everyone on their team — engineers and domain experts — see all the data, test prompt and model changes thoroughly, and define evals together. That’s where Freeplay comes in.
Custom Evals and Tests For Every Production Change
Postscript’s core need was clear: a way to test and evaluate their AI agent offline, at representative scale, before anything reached production. Unlike other AI products that have a human reviewing or editing the output before it’s seen by customers, Postscript’s Shopper sends messages straight to consumers on behalf of merchants. The margin for error is zero. For instance, no one wants a discount code that doesn’t work!
That meant they couldn’t guess or vibe-check their way through prompt and model changes. They needed real data, broad test coverage, and a fast feedback loop. Their testing datasets had to reflect the diversity of Postscript’s customer base — different brand voices, product catalogs, promotion styles, etc. — and work across thousands of merchants. And critical for their workflows, it had to enable the domain experts responsible for prompt engineering and data quality, without asking engineering to test every change.
“(Four months ago) there was a lot of uncertainty about how confident we could be about shipping and then iterating, whether it’s a small change or a big change. (Now) we have something in place that increases our confidence to go to market. We have something like a unit test suite and a good baseline, so that now I’d be pretty surprised if we made a feature change or a model change and were (significantly) off our quality baseline.” - Ian Chan, VP of Engineering
On top of that, the LLM ecosystem moves so fast with new models, versions, shifting APIs, and emerging inference providers. The team needed a setup that let them experiment at the same pace without rebuilding their application from the ground up for every minor experiment.
A Platform That Fit The Postscript Workflow And Supports Best Practices
When it came time to select a tool, Postscript looked at several platforms. In their words: Many felt too prescriptive, or treated evals as a secondary feature. By comparison, Freeplay treats evals as first-class priority and gave them the flexibility they needed for their non-engineer domain experts to work in lock step with engineering.
Freeplay gives the entire Postscript team the ability to define, reuse, and run evaluations — both offline and online — at scale, and tailored to their exact use case. Postscript has defined dozens of evals to ensure LLM outputs are structured in the right way, use the right brand voice, and even use the brand’s approved set of emojis 😎. Using the right balance of LLM as a judge, code evals, and human labeling and preference ranking, Postscript was able to craft an LLM eval suite they could trust to ship a great AI product.
"Freeplay has become part of how we stay confident in our changes. Whether it’s adopting a new model or tweaking a prompt, we use it to run offline tests before anything hits production. It’s not just about shipping, it’s about making sure we’re not guessing. We don’t want to break trust with merchants, and Freeplay helps us make sure we don’t." - Evan Romano, Lead Prompt Engineer
Just as important, the platform fits the way the team works. Prompt engineers and product managers run evaluations through the UI, while engineers use the SDK to automate and extend tests in code.
A Major API Migration, Without the Guesswork
One of the most critical turning points came during a major API migration, right ahead of a big launch. Postscript needed to move off the OpenAI Assistants API (where they’d built their initial prototype) and onto the Completions API that gave them more control.
The change meant reworking how the agent handled conversations, validating outputs across a wide range of customer types, and making sure nothing broke along the way. Without a way to test that migration at scale with real data, it wouldn’t have happened in time for their launch target.
"We were able to make that move from the OpenAI Assistants API to the Completions API because we had Freeplay, right before launch. If we didn’t have evals, we wouldn’t have been able to make that move because it would have just introduced too much risk. And without a suite of evals that we trusted, we would have had to do a shadow rollout or a canary rollout and then just review it as a human. It would have been a nightmare." Evan Romano, Lead Prompt Engineer" - Evan Romano, Lead Prompt Engineer
During this process, Freeplay gave Postscript’s domain experts a direct seat at the table. Evan Romano, who leads prompt design, could run evals, test changes, and improve prompts while freeing time for the engineering team to focus on development. That meant faster iteration, tighter feedback loops, and a shared workflow where product, engineering, and prompt experts could all move in parallel.
Building Trust into Every AI Feature
Now that Postscript’s new AI features are live, Freeplay is part of their day-to-day development loop. The team runs offline tests across dozens of evaluations to validate changes before they ship - whether it’s a new model, a prompt update, or a config tweak.
Evaluations aren’t just a box to check before deployment, they’re how Postscript moves fast without sacrificing quality. Freeplay helps the team ship smarter and represent each brand voice with confidence.
If you're building AI products and features and need a faster way to validate them across real-world use cases, Freeplay can help you move with increased speed and trust. Sign up to get started today here, or book a demo to speak with our team.
Categories
Case Study
Authors

Sam Browning