Building products with large language models (LLMs) is a challenging process to get right. As anyone who's actually shipped to production knows, building features with LLMs is not a "once and done" effort – it requires ongoing experimentation, testing, and iteration to maintain consistently great results.
There’s near-constant opportunity to tweak and improve prompts, try new models, adjust retrieval-augmented generation (RAG) pipelines, and more. This means it needs to be really easy to repeat that iteration loop. And for cross-functional teams, making it easy means being able to work both in code and in a browser.
At Freeplay, we've been working on this problem for over a year now and have gotten to see how hundreds of production software teams approach this experimentation and testing process with LLMs. We've built a platform and a workflow that makes it easier for everyone involved to experiment, test, and deploy with confidence.
There's a video demo at the bottom to see the whole workflow in action. Read on for more details.
What does the iteration process look like today for most teams before they start using Freeplay?
In most cases, it looks fairly ad hoc. Product managers frequently track test cases in a spreadsheet, and manually copy/paste to a web-based playground to test. Engineers store prompt and model config in code, and change small details here and there — without a clear versioning strategy or way to attribute quality scores to a specific configuration. QA and data science teams trying to make sense of raw logs to quantify “quality” and spot issues. Anyone we’ve talked to in one of these scenarios is looking for a better way.
That's why we’ve been building a better end-to-end workflow for product development teams to continuously iterate and improve on their LLM-powered features.
Here's what the experimentation & testing process looks like with Freeplay today.
A Better Way to Experiment & Test LLM Changes
1. Quick Experimentation in a Fully Integrated Playground
Freeplay treats every prompt iteration and model configuration together as an experiment, since each combination produces unique results. The process starts in our prompt editor – a playground-like environment that lets you rapidly test different combinations of prompts, models, and parameters with real-world data. No more copy-pasting from spreadsheets into a playground! With Freeplay, you can curate test datasets from uploaded or observed data directly on the platform and quickly see how a range of examples perform.
Screenshot of the Freeplay prompt editor, which lets you load different prompt & model versions and previously saved test cases for rapid iteration.
2. Test at Scale
Once you have a promising candidate, you can launch a batch test right from the browser. Our auto-evaluation feature scores each test against customizable evaluation criteria to measure things like quality, safety, or correctness — or any other criteria you want to define. This lets you quickly compare new versions to prior ones and decide whether to ship based on quantitative improvements.
Screenshot of a test run with scores across multiple evaluation criteria.
3. Integration Testing with the SDK
Browser-based tests are great for single prompts, but they can't fully test a code pipeline, a chain of prompts, or an agent flow. That's where Freeplay's SDK comes in. Once you've validated a prompt's performance in the browser, you can easily run an integration test to see how your code behaves end-to-end — including to validate changes to a RAG pipeline, see the behavior for a whole chain of prompts, etc.
Python example code pictured. We support SDKs for Python, Node/Typescript & the JVM – see other examples here.
4. Deploy, Then Monitor
Freeplay makes it possible for everyone – PMs, QA, domain experts, engineers, etc. – to participate in the experimentation process. Once you’ve run a test, compared the results, and are ready to push a prompt or model change live, you can do so right from the browser, just like a traditional feature flag. Our SDKs make it easy to call the right prompt template & model combo in the right environment. Production results are then logged and evaluated using the same evaluation criteria from testing so you can know if you change performs as expected with real production data.
Screenshot of our new comparison view coming soon, including the option to deploy straight from a test results page.
Here’s a quick Loom demo if you want to see this process in action.
At the end of the day, building with LLMs is a team sport. By empowering everyone to experiment and test together at multiple stages in the process, Freeplay helps you ship better products faster. Whether you're a product manager looking to validate ideas or try a new model, an engineer integrating LLMs into production code, or a domain expert providing feedback, Freeplay gives you what you need to get it done quickly and ship with confidence.
Interested to learn more? We’d love to talk. Reach out here.