Tool Management, Experimentation & Evaluations: Build Better AI Agents with Freeplay

Tool Management, Experimentation & Evaluations: Build Better AI Agents with Freeplay

Tool Management, Experimentation & Evaluations: Build Better AI Agents with Freeplay

Dec 12, 2024

Dec 12, 2024

As AI products mature beyond chat interfaces, development teams are increasingly building systems that can take action on a user’s behalf. These agentic AI systems use "tools" (aka "function calls") to interact with APIs, retrieve information, and perform tasks. Developing and maintaining these tool-enabled systems brings new challenges — especially when trying to coordinate across technical and non-technical team members. 

That’s why we’ve made some big updates that make it much easier to manage, record, and experiment with tools in your prompts and agentic systems using Freeplay. This post covers the highlights.

Alongside prompts, tools are a core primitive in the generative AI product development landscape — and they deserve first-class support. We’ve long supported tool use, but schema definitions had to live in code outside of Freeplay. Our customers told us they wanted more flexibility to iterate within Freeplay, but without the loss of control in code.

With our latest changes it’s now easy to:

  • Iterate on tool schemas, either in your code or in the Freeplay app — whichever workflow you prefer

  • Test tool behavior in the Freeplay playground and code, so that even non-engineers can make prompt changes or update tool descriptions easily

  • Swap your prompts between model providers without modifying tool schemas (e.g. when swapping prompts between OpenAI, Anthropic or Google)

  • Run offline experiments and tests that include tools, including the ability to include structured tool calls in datasets managed with Freeplay

  • Run auto-evals that target tool schemas as part of your evaluation checks to confirm tool selection and other behavior matches you expectations

Below is a quick demo video of what the new changes look like. Read on for more detail. 

Background

If you’re new to some of the challenges with using and managing tools, it might help to start with a quick primer. 

Generative AI systems use tools to interact with other APIs, retrieve information, and perform tasks. Tools have become an essential part of the workflow to build LLM-powered systems that interact with other services. A decent primer is here

However, developing and maintaining these tool-enabled systems brings new challenges, especially when trying to coordinate across technical and non-technical team members.

  • Tool schemas can be opaque and inaccessible when locked up in source. But it’s often the case with Freeplay customers that non-engineers want to experiment with prompt changes that include tool calls, or even make adjustments to tool definitions – especially since those tool definitions control which tools a model will choose. This workflow needs to be better.

  • Different model providers (e.g. OpenAI, Anthropic) handle tools differently and require different schemas and call structures. It can be a pain to modify things in order to quickly swap models.

  • Running prompt engineering experiments and testing AI systems that involve tool calling requires knowledge of those tools, including appropriate handling of tools in datasets used for testing, evals, etc. Getting a whole LLM ops system working well with tools takes work.

Our goal at Freeplay is to help AI teams build great AI products, and that means solving these problems in ways that allow engineers and non-engineering team members like PMs and data analysts to collaborate effectively. Ideally both groups are empowered, without stepping on each others’ toes. 

That’s why we’re excited to release these new features that make tool-enabled AI development more accessible, transparent, and reliable for the entire product development organization — not just engineers. Let’s dive into the details of what’s new.

What's New: A Better Toolkit for AI Product Development

We've made tools a first-class citizen in Freeplay, integrating them deeply with our prompt management system and other features. 

Tool schemas can now be managed and versioned alongside your prompts and model configurations, providing the same versioning and feature flag-style deployment options Freeplay has long provided for prompts. This allows for easy collaboration between engineers, product owners, and whoever is doing prompt engineering or running experiments (even if they’re not engineers).

  • Version tracking handles both tool schemas and prompts together

  • Choice of how you prefer to create and manage tool schemas: 

    • 1. Start in our playground, then retrieve via the Freeplay SDK (just like prompt templates), or 

    • 2. Record from your code, and then save a tool schema in Freeplay for future experimentation use in the UI. 

Already using tools but not using Freeplay yet? You can easily push your tools into the Freeplay system, our SDK works with your existing code.

Integrated Playground Experimentation

However you choose to create and manage tool schemas initially, the Freeplay playground now fully supports experimenting with prompt and model changes, as well as changes to tool descriptions, etc. that can help you tune model behavior.

  • Rapidly iterate on tool descriptions to see how they’re handled by different models

  • Make side-by-side comparisons of tool behavior across OpenAI, Anthropic, Google and other providers (including Azure OpenAI and AWS Bedrock)

Observability For Your Whole Team

We’ve redesigned how tools appear in our Observability features to better serve both technical and non-technical team members, so that anyone can easily see what's happened in an agentic workflow. The redesigned view of tool interactions makes it easier to interpret what tools are doing without having to read through complex stack traces, while still maintaining detailed raw data for engineers to debug. You can easily see what your whole system is doing – even the parts that don’t call an LLM.

This better enables:

  • Collaborative debugging of tools-enabled systems

  • Full labeling and dataset curation support for reviewers

  • More intuitive live monitoring and evaluation of agentic systems in production

Comprehensive Testing & Evaluation Support

We’ve expanded our evaluation and testing flows to better support tool-enabled systems too.

  • Freeplay’s auto-evals now support evaluation of specific tool-interaction components to ensure your LLM systems are invoking the right tools for the job.

  • Freeplay datasets fully integrate tool schemas natively, so it’s easy to collect and maintain datasets for testing tool-enabled features

  • Quickly run batch tests to validate and quantify tool behavior — either through the Freeplay UI or our SDK

Provider Flexibility

We’ve made updates across our SDK to make it simple to work with tools across LLM providers. When you manage your tool schemas with Freeplay, the SDK will handle the schema translation across providers any time you choose to swamp providers. We’ve done this since the beginning with prompts, and now provide the same flexibility for tool use.

Why This Matters

Tool use, and agentic systems more broadly, are becoming the norm for AI product teams.

But with it, the complexity of building and maintaining these systems increases significantly. Existing approaches create silos between engineers and other team members, slowing down iteration and making it harder to improve a product.

Freeplay’s approach to tool support aims to break down these barriers. By making tool-enabled AI products more accessible and observable, entire product development teams can collaborate effectively on building and improving AI agents.

Getting Started

If you're already using Freeplay, these features are all live now. An updated intro guide is here, and integration details are in the SDK docs and API docs.

If you're new to Freeplay, we’d love to get you started. Grab time with our team today.

Keep up with the latest


Keep up with the latest


Keep up with the latest