How Postscript Built an AI Agent for E-commerce That Works Well for Thousands of Customers: Evals, Testing and Human Review

Learn how Postscript scaled an AI agent for e-commerce using Freeplay’s AI evaluation and prompt iteration tools. Discover how they tested extensively to confirm agent behavior, migrated to new APIs confidently, and built trust with thousands of merchants.

TL;DR: Postscript used Freeplay to build out the eval suite and experimentation workflow they needed to safely scale an AI-powered e-commerce agent that sends personalized messages with no human in the loop. Using offline evaluations, datasets that provided test coverage for 1000’s of merchants, and prompt iteration tooling from Freeplay, they shipped a high-quality agent product on a hard deadline, validated both test and production outputs confidently, and built trust with their large, established customer base.

What Postscript Is Building

Postscript is building something ambitious: an AI-powered conversational agent called "Shopper" that sends personalized SMS messages directly to consumers on behalf of thousands of e-commerce brands. The goal is 1:1 personalized conversations with customers.

Since messages go out without any human editor in the loop, trust in the agent is critical once it’s deployed. The impact of every change needs to be validated across a wide range of customer scenarios. The challenge is scaling review of those validations, where a manual human review of every new message just isn’t possible.

Enter Postscript’s suite of custom evals, created and managed with Freeplay.

"Everyone who thinks about (evals) deeply knows that it’s actually the most important part of your system…When I talk to my team, I say everything’s downstream of a good eval. If you want to have a good prompt, the only way you know how to do that is with a good eval" - Evan Romano, Lead Prompt Engineer

Whether it’s a model change or a prompt tweak, the Postscript team needs to know every update is a safe bet before anything ships. And as they’ve sought to build trust with their customers, they want a way to quantify their confidence for each customer..

Part of their solution has meant involving domain experts from their team in a tight feedback loop with engineering. They needed a system that let everyone on their team — engineers and domain experts — see all the data, test prompt and model changes thoroughly, and define evals together. That’s where Freeplay comes in.

Custom Evals and Tests For Every Production Change

Postscript’s core need was clear: a way to test and evaluate their AI agent offline, at representative scale, before anything reached production. Unlike other AI products that have a human reviewing or editing the output before it’s seen by customers, Postscript’s Shopper sends messages straight to consumers on behalf of merchants. The margin for error is zero. For instance, no one wants a discount code that doesn’t work!

That meant they couldn’t guess or vibe-check their way through prompt and model changes. They needed real data, broad test coverage, and a fast feedback loop. Their testing datasets had to reflect the diversity of Postscript’s customer base — different brand voices, product catalogs, promotion styles, etc. — and work across thousands of merchants. And critical for their workflows, it had to enable the domain experts responsible for prompt engineering and data quality, without asking engineering to test every change.

“(Four months ago) there was a lot of uncertainty about how confident we could be about shipping and then iterating, whether it’s a small change or a big change. (Now) we have something in place that increases our confidence to go to market. We have something like a unit test suite and a good baseline, so that now I’d be pretty surprised if we made a feature change or a model change and were (significantly) off our quality baseline.” - Ian Chan, VP of Engineering

On top of that, the LLM ecosystem moves so fast with new models, versions, shifting APIs, and emerging inference providers. The team needed a setup that let them experiment at the same pace without rebuilding their application from the ground up for every minor experiment.

A Platform That Fit The Postscript Workflow And Supports Best Practices

When it came time to select a tool, Postscript looked at several platforms. In their words: Many felt too prescriptive, or treated evals as a secondary feature. By comparison, Freeplay treats evals as first-class priority and gave them the flexibility they needed for their non-engineer domain experts to work in lock step with engineering.

Freeplay gives the entire Postscript team the ability to define, reuse, and run evaluations — both offline and online — at scale, and tailored to their exact use case. Postscript has defined dozens of evals to ensure LLM outputs are structured in the right way, use the right brand voice, and even use the brand’s approved set of emojis 😎. Using the right balance of LLM as a judge, code evals, and human labeling and preference ranking, Postscript was able to craft an LLM eval suite they could trust to ship a great AI product.

"Freeplay has become part of how we stay confident in our changes. Whether it’s adopting a new model or tweaking a prompt, we use it to run offline tests before anything hits production. It’s not just about shipping, it’s about making sure we’re not guessing. We don’t want to break trust with merchants, and Freeplay helps us make sure we don’t." - Evan Romano, Lead Prompt Engineer

Just as important, the platform fits the way the team works. Prompt engineers and product managers run evaluations through the UI, while engineers use the SDK to automate and extend tests in code.

A Major API Migration, Without the Guesswork

One of the most critical turning points came during a major API migration, right ahead of a big launch. Postscript needed to move off the OpenAI Assistants API (where they’d built their initial prototype) and onto the Completions API that gave them more control.

The change meant reworking how the agent handled conversations, validating outputs across a wide range of customer types, and making sure nothing broke along the way. Without a way to test that migration at scale with real data, it wouldn’t have happened in time for their launch target.

"We were able to make that move from the OpenAI Assistants API to the Completions API because we had Freeplay, right before launch. If we didn’t have evals, we wouldn’t have been able to make that move because it would have just introduced too much risk. And without a suite of evals that we trusted, we would have had to do a shadow rollout or a canary rollout and then just review it as a human. It would have been a nightmare." Evan Romano, Lead Prompt Engineer" - Evan Romano, Lead Prompt Engineer

During this process, Freeplay gave Postscript’s domain experts a direct seat at the table. Evan Romano, who leads prompt design, could run evals, test changes, and improve prompts while freeing time for the engineering team to focus on development. That meant faster iteration, tighter feedback loops, and a shared workflow where product, engineering, and prompt experts could all move in parallel.

Building Trust into Every AI Feature

Now that Postscript’s new AI features are live, Freeplay is part of their day-to-day development loop. The team runs offline tests across dozens of evaluations to validate changes before they ship - whether it’s a new model, a prompt update, or a config tweak.

Evaluations aren’t just a box to check before deployment, they’re how Postscript moves fast without sacrificing quality. Freeplay helps the team ship smarter and represent each brand voice with confidence.

If you're building AI products and features and need a faster way to validate them across real-world use cases, Freeplay can help you move with increased speed and trust. Sign up to get started today here, or book a demo to speak with our team.

Subscribe to our newsletter

Build Production-Grade AI Agents with End-to-End Agent Evaluation and Observability

Product

Build Production-Grade AI Agents with End-to-End Agent Evaluation and Observability

Product

Scale Your AI Quality Reviews: How Freeplay's New Human Review Features Accelerate Product Improvement

Product

Industry

Scale Your AI Quality Reviews: How Freeplay's New Human Review Features Accelerate Product Improvement

Product

Industry

Build Production-Grade AI Agents with End-to-End Agent Evaluation and Observability

Product

Build Production-Grade AI Agents with End-to-End Agent Evaluation and Observability

Product

How Help Scout Achieved 75% LLM Cost Savings & Ships AI Features Faster with Freeplay

Case Study

How Help Scout Achieved 75% LLM Cost Savings & Ships AI Features Faster with Freeplay

Case Study

Scale Your AI Quality Reviews: How Freeplay's New Human Review Features Accelerate Product Improvement

Product

Industry

Scale Your AI Quality Reviews: How Freeplay's New Human Review Features Accelerate Product Improvement

Product

Industry

Build Production-Grade AI Agents with End-to-End Agent Evaluation and Observability

Product

How Help Scout Achieved 75% LLM Cost Savings & Ships AI Features Faster with Freeplay

Case Study

Back to overview

Product

Blog

Resources

Company

Pricing

Book a demo

How Postscript Built an AI Agent for E-commerce That Works Well for Thousands of Customers: Evals, Testing and Human Review

How Postscript Built an AI Agent for E-commerce That Works Well for Thousands of Customers: Evals, Testing and Human Review

What Postscript Is Building

Custom Evals and Tests For Every Production Change

A Platform That Fit The Postscript Workflow And Supports Best Practices

A Major API Migration, Without the Guesswork

Building Trust into Every AI Feature

Subscribe to our newsletter

Related articles

Build Production-Grade AI Agents with End-to-End Agent Evaluation and Observability

Build Production-Grade AI Agents with End-to-End Agent Evaluation and Observability

Scale Your AI Quality Reviews: How Freeplay's New Human Review Features Accelerate Product Improvement

Scale Your AI Quality Reviews: How Freeplay's New Human Review Features Accelerate Product Improvement

You may also like

Build Production-Grade AI Agents with End-to-End Agent Evaluation and Observability

Build Production-Grade AI Agents with End-to-End Agent Evaluation and Observability

How Help Scout Achieved 75% LLM Cost Savings & Ships AI Features Faster with Freeplay

How Help Scout Achieved 75% LLM Cost Savings & Ships AI Features Faster with Freeplay

Scale Your AI Quality Reviews: How Freeplay's New Human Review Features Accelerate Product Improvement

Scale Your AI Quality Reviews: How Freeplay's New Human Review Features Accelerate Product Improvement

Build Production-Grade AI Agents with End-to-End Agent Evaluation and Observability

How Help Scout Achieved 75% LLM Cost Savings & Ships AI Features Faster with Freeplay