Development teams building with generative AI quickly realize that the ability to measure quality and iterate with confidence are essential to success – and they’re learning that good evals unlock both. However, most teams building in this space have never had to write or run evals before. They often find themselves hesitating. People tell us they’re unsure of how to begin, or how to know if their evals are “right.”
We’ve integrated AI assistance from Freeplay at several points in the eval creation process to make it easier to get started:
Create Your Own Evals: Start with the basic question you want to answer, and Freeplay’s AI will help you refine the question and write a good prompt for a model-graded eval (aka “LLM as a judge”) that is targeted to your prompts and data
Start With A Template: We’ve also made it easy to begin with common evals (e.g. RAG evals like Answer Faithfulness, other common evals like Similarity, Toxicity, or Tone). We’ve found that even when starting with a template, people need to adapt them to their prompts and data – Freeplay’s AI will again automatically update a model-graded eval prompt to match your prompts and data.
Here’s a quick video that shows what it looks like. Read on for more details and how you can get started.
Why You Need Good Evals
Before we dive into the new features, let's recap for a moment why evals are so critical in AI product development.
Evals are systematic assessments of an AI model's performance and behavior. They serve as a quality control mechanism, helping teams:
Measure the model's effectiveness in real-world scenarios / your own product context
Identify potential biases or errors in output
Ensure consistency across various inputs
Track improvements over time as prompts are refined or models get updated
Without a good, product-specific evaluation suite, teams are essentially flying blind as they iterate on their AI systems. Once you have good evals in place, it’s easy to test and iterate quickly or monitor how your system behaves in production.
What's New With Freeplay Eval Creation
For those who recognize the importance of evals but are looking for some help to get started faster, we built these features to simplify the process. Both features are designed to make it easy for anyone to get started creating good, relevant evals they can trust – even if they’ve never done it before. And even if they don’t write code!
AI-Assisted Custom Eval Creation
Our new “create your own option” for model-graded evals gives AI coaching and feedback on your eval structure, then automatically drafts a strong eval prompt that you can immediately test in Freeplay.
One of the benefits of using the Freeplay platform is that, since we know about your prompt structure and have real-world examples from your logs, we can help you draft eval prompts that are specific to your context.
AI Suggestions on eval phrasing
Eval Template Library
While we often see teams create their own, highly-tailed evals from scratch, it can also be helpful to get some initial ideas or borrow from emerging industry best practices. Our new template library makes it easy to get started with common evals that we know help other teams. And Freeplay’s AI again automatically adapts these templates to your prompt templates and data — including referencing the right input variables — so you can create truly customized evals with one click.
AI-generated prompt from a template called "Input Safety"
Once you create a new eval with either of these paths, our previously announced eval alignment flow makes it quick and easy to test and provide feedback on evals so you can ensure alignment with human experts' judgement.
Why This Matters
Freeplay’s new AI assistance features represent a significant step in our mission to help product development teams build great AI products, and to democratize AI development. These flows are accessible to developers and non-developers alike, and once you create a new eval it’s easy to publish to production without needing to touch any code or do a deploy. We automatically run your evals on both production logs and tests or experiments for you.
Faster Time-to-Market: With Freeplay, you can set up, run, and iterate on evals quickly, helping you ultimately get a better product to market faster.
Increased Confidence: Evals are the only way to experiment and iterate on generative AI features with confidence, or to understand what’s happening in your customer experience. They give you the power to quantify how your AI systems behave.
Cross-Functional Collaboration: Product managers, analysts, and domain experts can contribute directly to the eval process, either alongside of or independent from engineering team processes.
Easy Adoption of Best Practices: Freeplay's AI features help refine your evals into the right form and validate that new evals are behaving as expected before deployment.
Continuous Product Improvement: The ability to write, test, and deploy evals via Freeplay allows for ongoing refinement and optimization of your AI features.
Give Your Team A Better System For AI Development
Freeplay's new AI features for eval creation and alignment work together with Freeplay’s prompt and model management, observability, testing, and data labeling tools to help you ship faster and with confidence. We're empowering teams to work together to create great AI products for their customers.
Bring your prompts and test cases, and use these new tools to quickly explore evals — even if you’ve never used them before. Ready to get started? Get in touch.