How to Test and Optimize Multimodal AI Workflows with Freeplay

How to Test and Optimize Multimodal AI Workflows with Freeplay

Discover a better way improve your voice agents and multimodal products. New tools provide end-to-end coverage for multimodal evals, observability, and experimentation.

Discover a better way improve your voice agents and multimodal products. New tools provide end-to-end coverage for multimodal evals, observability, and experimentation.

Builders need better tooling to create and optimize their multimodal AI projects. If you really want to understand how a voice agent or image analysis app is performing, you don't just want to capture the relevant files. You need to instrument each part of the system to create a faster loop for iteration and error detection.

That's why we've built end-to-end support to optimize multimodal AI products that helps teams complete the full build/test/learn/optimize loop for their multimodal applications. Take data straight from production logs, turn it into datasets for batch testing, experiment with original files in our playground as you make changes to your prompts and models, and run relevant evals with every iteration.

This post walks through those new features for multimodal app evaluation, experimentation, and observability — as well as best practices for how they're being used to more quickly iterate on multimodal systems and improve quality.

You can check out the demo below, and read on to learn how teams are using Freeplay to ship better multimodal apps, faster.

What's the problem?

AI teams are moving fast on multimodal applications — systems that pull structured data out of PDFs, analyze images, understand and respond with voice, and more. While model capabilities have advanced, evaluation tasks and workflows have become more complex with multimodal systems. And those workflows are even more important to get right as these systems move into production.

Voice transcripts miss nuance. Text-to-speech systems can say the wrong things. Data gets missed in OCR use cases, or images get misinterpreted. And without structured evaluation, there’s no way to know what’s working and what isn’t.

As Pipecat creator Kwindla Hultman-Kramer put it on our Deployed podcast when talking about voice agents:

“None of us have enough evals… Most voice evals today are still just vibes.”

Kwin captured what we’re hearing from teams across the board, and across modalities.

Whether you're testing to improve instruction-following, tool usage, or content extraction, you need a way to quickly validate model outputs across a range of real-world inputs and files (PDFs, PNGs, MP3s, etc.). You need to be able to quickly go from finding errors to running experiments to fix them. And once you deploy a change, you need to be able to detect any new surprised in production.

Building production-grade workflows that actually work requires instrumentation for each of those steps in the process.

What’s New in Freeplay Multimodal

We’ve extended Freeplay’s evaluation, experimentation, and observability capabilities to fully support audio, images, and documents. That means you can now track, test, and troubleshoot multimodal apps just like any other LLM workflow, across any supported models.

Multimodal Prompt Engineering Support

  • Add media inputs like voice_recording, product_image, or uploaded_document as variables in your prompt templates

  • Upload and test media files directly in the Freeplay playground

  • Run batch tests with all your evals when you're ready to pressure test new changes

  • Works with OpenAI, Claude, Gemini, and any model that supports multimodal input

Media-Aware Observability

  • Record and view original audio, image, or PDF inputs along with the rest of your LLM completion details

  • Trace behavior through agent pipelines with full context

  • Use Freeplay to run your own custom evals on production logs to automatically catch regressions or issues

  • Save any use examples from your logs directly to datasets for future testing or fine-tuning

Eval Integration for Mixed Inputs

  • Run evals over voice conversation transcripts or image analysis outputs tied to multimodal prompts

  • Define success criteria like instruction-following, tone, factual accuracy, or format adherence

  • Use the same evals for both offline testing and experimentation, and to evaluate production logs

Developer SDK Support

  • Use Freeplay’s SDKs to pass and manage media files (base64 or URL references)

  • No orchestration lock-in, you control the code you want to use

  • Easily integrate with your existing agents, API endpoints, or custom pipelines

Check out our new multimodal guide in the Freeplay docs for all the details.

Best Practices for Multimodal AI Systems

Building great production-grade multimodal apps depends on how you structure your workflows, evaluate performance, and iterate as new edge cases emerge. One of the best things you can accomplish on the ops side is a fast way to iterate with confidence.

We've seen a few patterns that consistently lead to better results:

  1. Break your system into clear states.
    Whether you're guiding a conversation, parsing a PDF, or interpreting an image, structure your application as a series of discrete stages or “states.” Each step should have a defined prompt template, specific inputs (text and/or media), and a clear exit condition. Successful production systems break complex tasks into such states with targeted prompts and tools. This modular, state-machine approach makes debugging easier and scaling more predictable, since each state can be optimized and tested in isolation.

  2. Start by measuring what matters. In a multimodal app, there are many things you could measure, but start with the outputs that directly affect your users and that you can evaluate consistently. Focus on whether the agent followed instructions, called the right tools, extracted the correct field from a PDF, or categorized an image properly. These are objective criteria you can verify (often via text outputs or simple checks). Text-based outputs – even those originating from audio or image inputs – are usually the easiest place to begin. Turn latency matters a ton too for audio. By establishing a baseline for these crucial metrics, you get immediate signal on quality. Once you’re confident in those, you can gradually add more nuanced evals as you discover other issues during testing.

  3. Keep evaluations actionable (focus on content over form). It’s easy to get sidetracked thinking about evals for the raw media aspects of your system (audio waveforms, image pixels, etc.), but a ton of impactful improvements can come from tried and true text evals and human-in-the-loop reviews. In practice, this means using transcripts and extracted text as your anchor for evaluation. For example, if you’re building a voice agent, focus your automated evals on the speech-to-text transcript and what the AI said in response. Human reviewers can test out the audio attributes and review examples manually to get a sense for whether things are sounding right (part of why it’s nice to also have the audio file logged for reference). Whatever you do, try to shape your evals toward actionable signals.

  4. Instrument and monitor each step. Make observability a first-class citizen in your multimodal pipeline early. That means logging and saving the inputs and outputs at every stage — even in dev. By instrumenting each part of your application, you can quickly pinpoint where things go wrong (e.g. a misrecognized word in transcription or a faulty image classification) and measure performance of each component. Robust pipelines include using evals on logs for automated monitoring and regularly reviewing errors by hand will help you catch and fix issues early.

  5. Iterate regularly on real examples. Multimodal models are evolving quickly, and they often behave unpredictably on edge cases. The best teams run a continuous optimization cycle that incorporates real-world data into their development cycle. Instead of guessing how a change might play out, they continuously test with actual observed examples from production (or close to it, when PII/PHI/etc. needs to be redacted). For every new prompt tweak or model update, run a batch of diverse real inputs – actual PDFs, images, audio snippets – through your workflow to “back test” your system offline, before you ship to customers. This helps catch regressions or weird outputs immediately, rather than weeks later. Over time, this habit of rapid, real-data experimentation lets you “hill-climb” your way to a better product: you make incremental improvements, guided by feedback from real files and transcripts, and quickly converge on what works best for your customers.

Get Started with Multimodal in Freeplay

Whether you’re building an AI voice agent for sales automation, or combining images and document context for evaluating support tickets, Freeplay gives you the end-to-end tools for evals, observability, and iteration speed to ship with confidence.

Explore general multimodal support here:
👉 Working with Multimodal Data in Freeplay

If you're thinking about building a voice agent, try starting with this sample app:
👉 Build Voice-Enabled AI Apps with Pipecat, Twilio, and Freeplay

Categories

Product

Authors

Morgan Cox

Subscribe to our newsletter