In our latest episode of Deployed, we talk with Kwindla Hultman-Kramer, co-founder of Daily and creator of Pipecat, the most widely used open source framework for voice agent orchestration. Kwin has been building voice products for the web since long before AI was cool, and he literally wrote the book on voice AI development.
What makes this conversation particularly valuable is Kwin's perspective after watching so many builders and companies go through the transition from experimental demos to production-scale voice AI deployments. He's at the center of the voice AI ecosystem, and gets to see where voice AI is creating real business ROI and adoption among customers. He's learned lots of hard lessons about what actually works, and he shares a range of advice about what models and protocols to use, how to think about evals, and what metrics really matter. He also shares his take on which use cases make sense today, and what's on the horizon.
For teams considering an investment in voice AI or who are already working to tune a production-grade system, Kwin shares tactical insights that could save months of trial and error.
Check out the full episode here, or read our recap of some highlights below. You can also download the full episode on Spotify, Apple Podcasts, or YouTube.
Why Voice AI Is Having Its Moment In 2025
Kwin sees clear evidence that voice AI has hit an inflection point, with more teams rapidly moving from interesting demos to real production deployments:
"I think we hit the inflection point where first there was a bunch of really concrete interest in voice AI like very late last year, and then starting kind of early this year we're seeing real production scale deployments grow... there's really genuine large time savings and ROI on deploying these things for specific use cases."
His go-to example illustrates how voice AI can create incredible value: Healthcare practices using voice agents to call patients before appointments.
What used to require significant staff time (reminding patients about appointments, answering pre-visit questions, collecting information) can now be automated for the practice with structured data entered into other systems, while simultaneously improving the patient experience.
The pattern he sees is telling: teams start with low-risk use cases like after-hours calls where the alternative is no service at all. They prove the value (and learn the tech), then expand to more complex workflows. The business case drives adoption, not the technology's novelty.
Beyond healthcare, Kwin highlights two other major categories gaining traction: "answering the phones for small businesses" (restaurant reservations, appointment scheduling, or basic customer service), and then on the enterprise side, more complex customer support workflows in the call center where voice agents handle initial triage and data collection.
The Critical Role of Voice Agent Evaluation (And Why Most Voice Evals Are Still "Vibes")
One of the most practical insights for builders centers on evaluation. We get lots of questions about how to handle evals for voice agents, and Kwin shares an overview of both how to get started, and what types of evals really matter.
"I think none of us have enough evals... it's just hard to build multi-turn conversation use case evals even if you're just operating in text mode. So the low-hanging fruit might be take eval tooling, take Freeplay's kind of text-based evals and just hook up voice to that because something is better than nothing."
Kwin's advice is refreshingly practical: start with what you know. Most teams can adapt their existing text evaluation approaches to voice by focusing on voice transcripts. This covers the core functionality in a voice interaction like whether the agent is following instructions, calling the right tools, extracting information correctly, etc.
Beyond that, lots of teams have an instinct they should be doing more with the .wav file and evaluating attributes of the voice format itself. But he's candid about the current state of voice-specific evaluations:
"I'll tell you where I think everybody is, without exception as far as I know, which is our voice quality evals are largely vibes. If you can get the text mode evals to a place where they're very very good for your multi-turn conversations, you're probably okay if your voice eval stuff is mostly vibes."
When it comes to what matters most that's unique to voice, Kwin recommends teams focus on latency per turn (critical for maintaining conversation flow) and try to catch obvious issues like mispronounced numbers or addresses.
But the sophisticated audio file analysis that teams often think they need? It's not where the highest-impact improvements come from (at least, not yet).
Production Best Practices That Can Save You Months
Beyond evals, Kwin shares several other tactical recommendations that come from seeing teams make the same mistakes repeatedly.
Use WebRTC, Not WebSockets for Real-Time Audio
"WebRTC is like a protocol built from the ground up for real-time streaming audio so it maintains that low latency connection sort of at all costs... WebSockets optimizes for guaranteed delivery of data and doesn't optimize for real time. You won't see the problem when you're testing on your own very good machine and very good network, but in production you'll have 15% of calls that just balloon in latency or have unexpected disconnects."
This is the kind of insight that can save teams weeks of debugging production issues that don't appear in development.
Avoid Speech-to-Speech Models in Production (For Now!)
Despite the excitement around OpenAI's Realtime API and similar offerings from Google and others, Kwin recommends a more conservative approach today, in July 2025:
"The speech-to-speech models are the future. But today what you should do in production is a transcription model, a text-mode LLM, and a voice generation model because that's going to be better at function calling, better at instruction following, easier to instrument and debug, and overall much more stable."
His advice: Use speech-to-speech for prototyping and demos, but when you need reliability at scale, the "three-model approach" (speech-to-text → LLM → text-to-speech) gives you better control and more options for fallbacks when individual services have issues.
Design for State Management Early
One of Kwin's last insights is about the importance of conversation state management. Rather than giving the LLM one long prompt and hoping it maintains context across a multi-turn conversation, successful production systems break conversations into states with specific prompts, tools, and exit conditions for each state.
This "state machine" approach helps balance the two key capabilities teams need: open-ended conversation, and reliable instruction following. Each state can be optimized for its specific goals, leading to much higher success rates in complex workflows.
The Voice AI Technical Architecture That Works
To summarize, Kwin breaks down today's "best practice" for a production voice AI stack into four layers that teams need to think about:
Models (speech-to-text, LLMs, text-to-speech)
APIs (how you access those models)
Orchestration (frameworks like Pipecat that handle the complexity)
Application code (your specific business logic)
His advice: don't try to build the infrastructure layers yourself. 2024's problems (low latency, turn detection, context management, function calling in real-time contexts) are largely solved in good frameworks. Focus your energy on July 2025's problems, like steering LLMs effectively for your specific use cases (especially with complex multi-turn conversations).
Looking Forward
Kwin predicts that video (voice + real-time avatars) will hit the same inflection point that voice just did within the next year. The technology is nearly out of the uncanny valley, costs are dropping, and early business use cases like training, compliance, and interviews are proving value. Consumer voice and video use cases are still being figured out, but could also surprise us all.
For teams getting started, his advice is simple: start building. Voice is becoming the common denominator UI for generative AI applications, and teams who start experimenting now will help define the interaction patterns that become standard.
The opportunity is significant: we're living through the early days of a platform shift, and voice interaction will be central to how we all work with AI in the future.
Want to dive deeper? Check out Kwin's comprehensive guide at voiceaiandvoiceagents.com and explore the Pipecat framework on GitHub. You can also quickly get started integrating Pipecat with Freeplay for observability and evals by following this guide. Details about managing multimodal data in Freeplay are here.
For more conversations like this, subscribe to Deployed on Spotify, Apple Podcasts, or YouTube.
Categories
Podcast
Authors

Ian Cairns