Simplify Chatbot Testing and Evaluation with Freeplay

Simplify Chatbot Testing and Evaluation with Freeplay

Simplify Chatbot Testing and Evaluation with Freeplay

Jul 17, 2024

Jul 17, 2024

Chatbots are one of the most common user experiences for generative AI products, and they’re often where product teams start. They’re also surprisingly hard to get right. Testing edge cases and iterating on a chatbot without introducing regressions can be particularly challenging.

To address these challenges, we’ve launched a series of new features at Freeplay that speed up the process of building, testing, evaluating, and monitoring chatbots. This post walks through what’s new, including a quick demo video below. Our goal is help product development teams to build robust AI systems and improve them over time, and these updates are significant step toward making that easier for teams building chatbots.

What’s new

  • Chat view for easy data review and trace exploration

  • Format prompts to include conversation history

  • Save (and modify) conversation history in datasets for easy testing

  • Run batch tests and automate evaluations on real-world chat scenarios

Here’s a quick video to see how they work together. More details below!

Why Chatbots Are Challenging

Chatbots introduce a few unique challenges:

  1. Open-ended flow: Customers can ask any question they want, shifting from off-topic to on-topic questions or dialog in an instant.

  2. Statefulness: Maintaining context over the course of a conversation add complexity.

  3. Conversational design: People expect interactions with a chatbot to feel natural and human-like, not static or stilted.

For those reasons and more, it’s important to consistently test for a wide range of scenarios, monitor real-world interactions with your customers, and respond to customer feedback without the fear of introducing regressions.

Reviewing chat conversations in Freeplay

At the core of building any great LLM product is having a deep understanding of the data going into and coming out of your system. The best teams look at lots of row-level data to catch nuanced issues, then take action to improve their prompts, try new models, improve their RAG pipelines, extend their eval suite, etc.

Looking at chat conversations in most LLM observability products is a pain, and Freeplay didn’t make it easier until now. A given chat turn might consist of a handful of LLM requests and tool calls, making it hard to piece together what the customer saw by looking through each step in a trace.

Our new “Chat” view in the Freeplay Observability simplifies this process. It allows you to quickly review customer interactions (conversation “turns”) while providing access to detailed trace information. Enable this view by passing an input_question and output_answer at the start and end of any trace (see docs here).

Reviewing chatbot data just got so much better:

  • Quickly see chat turns to easily understand the context of customer conversations

  • Dig into trace details for each chat turn, from the chat view or the trace view

  • Use Live Filters to set up complex filters and review new conversationss based on customer feedback, eval scores, and more

Defining chat prompts with history

Managing state in a chatbot involves keeping track of conversation history. Depending on your prompting strategy, the way you manage history can get complicated. Developers traditionally manage this in code separate from prompts, but this can also limit your options to iterate on prompt structure easily — especially if you have non-developers working on prompts like so many Freeplay customers.

That’s why we’ve made it easier to define how conversation history should be passed to an LLM as part of a prompt template. Our platform now supports a special class of message in prompt templates called history that accepts an array of prior messages. This makes it easy to experiment with example conversation histories in our playground and testing features.

Managing chat datasets

Once you have a prompt template that use history, you’ll want to have good compatible datasets to start experimenting and testing.

Since Freeplay knows the exact structure of all your prompts and various inputs, it’s easy to save and curate real-world examples into useful datasets for testing in the future.

For instance, if you get some negative customer feedback on a specific chat turn due to a hallucination or other problem, you might want to test that scenario again in the future to make sure you improve. Conversation history will automatically save to a dataset so you can test that exact state quickly again in the future.

Plus: You can edit example in a dataset that contains history values directly in Freeplay, enabling you to create more challenging test cases or refine existing ones.

Experimentation & testing chat scenarios

The last step of taking advantage of these new features is bringing them all together into a test suite you can trust and rely on, whether you’re running offline experiments or doing an integration test.

Once you’ve set up prompt templates to use history, started logging traces from your application, and saved good test cases into example datasets, it’s easy to start automating tests from within the Freeplay app or via our SDK. You can configure evals to run for a relevant prompt and automatically score tests results to instantly see how you’re improving or regressing vs. prior versions of your system.

This enables both engineers and non-engineers to confidently iterate on prompts, test new models, and experiment with changes to your chatbot.

Getting started

If you’re already using Freeplay, we’ve launched a new guide here in our docs that walks through the details of getting set up. If you’re interested to get access, please reach out.

Keep up with the latest


Keep up with the latest


Keep up with the latest