Adding Voice AI to WebRTC applications presents unique technical challenges and user experience considerations. How do you architect systems that handle real-time audio processing, maintain conversational context, and deliver natural, responsive interactions? And how do you design interfaces that adapt to the dynamic nature of AI-powered communication?
In this episode, two members of the WebRTC.ventures team share insights from real-world projects integrating AI agents into live video environments, and designing interfaces for them.
- Hector Zelaya (WebRTC Developer Advocate) will explore the real-time requirements of routing audio from a WebRTC session to an AI agent instead of a human. He’ll discuss common architectural patterns like the “bot runner” approach, the roles of speech-to-text (STT) and text-to-speech (TTS) services, and best practices for effectively combining these technologies.
- Daniel Phillips (Lead UI Designer) will dive into designing interfaces that adapt to user needs and context in real time, and how that has changed from traditional approaches. He’ll talk about using UX to address latency, the human-in-the-loop philosophy, and essential transparency features that provide context for AI-generated responses.
Join us to discover how your organization can take advantage of Voice AI technology today, gaining practical strategies to leverage this emerging capability and stay ahead in an increasingly AI-powered communication landscape.
Bonus Content
- Our regular monthly industry chat with Tsahi Levent-Levi. This month’s topic: Is too WebRTC comple? You can also watch this content on our YouTube channel.
Scroll down for key insights and episode highlights.
Watch Episode 103!
Key Insights
⚡ Designing for Voice AI means tackling complexity on two fronts: technical and UX-related. It’s not just about making AI work. It’s about making it feel natural in real time, especially in high-stakes, dynamic environments.
Daniel explains, “There is a lot that needs to be looked at when we integrate AI into our applications.” Daniel gave a demo of an internal project, LiveCart, created as a WebRTC-based platform that enables real-time video commerce with an AI Assistant to help you sell to an audience of live viewers. Daniel talked about how Voice AI is valuable, but also “introduces a lot of complexity, not only technical, but also UX.” We have to design to minimize latency issues in communicating with the Voice AI Assistant, and to build in capabilities for a human-in-the-loop to allow for control of the sales situation. This is in addition to the standard complexity of building a WebRTC based application that streams out to a large crowd.
⚡ Some latency is inevitable in real-world Voice AI, but good UX can make it feel seamless. In text-based LLM applications, users of a tool like ChatGPT are used to it taking the AI a few seconds to respond, even up to 10 or 20 seconds. These delays become more problematic if they are allowed to persist in a Voice AI based system, because we need low latency to allow for seamless communication. In addition to optimizing the technical architecture to provide responses in a normal conversational manner, users are less likely to get frustrated if they understand what’s happening. Daniel notes that “adding real-time status labels keeps the experience feeling smooth and trustworthy.”
In the demo that Daniel showed, he noted how there is really no noticeable latency when he talks to the avatar. In production situations, to account for the possibility that there will be some delay for some interactions, Daniel noted how the design incorporates a “label at the top of the avatar’s video wrapper that says the state of the AI avatar so … if it’s a complex question, you can see how it’s thinking.” This is an important way to handle some of the inherent latency in Voice bots for more complicated questions.
⚡ The technical challenges of Voice AI in WebRTC. Turn-taking and managing interruptions in Voice AI are complex problems; even humans sometimes struggle to know when to speak. Hector shares his insights on addressing these challenges, “These are kind of things that without the right expertise, it can be very difficult to manage. And what is that expertise that you need? You need expertise around real-time communications, specifically with WebRTC, and you also need the right expertise to architect your AI application and stack in the right way [for low latency and turn detection].”
Episode Highlights
The human-in-the-loop approach helps improve the user experience.
Human conversations rely on subtle cues to manage turn-taking, but voice AI systems struggle with this, especially due to latency. To address these challenges, Daniel and his team introduced keyboard shortcuts that give human hosts quick control over the AI.
As Daniel explains: “The keyboard shortcuts came from a struggle to interact with the chat because it’s different. It’s when you’re talking to someone, you want them to mute or shut up or stop talking. I don’t know, maybe it’s something that we as humans are able to take cues easier, but when you’re just talking, and also latency is a big factor. When we’re testing these voice applications, it’s hard. They start talking and you want them to stop and then it’s cut on what the AI is trying to say and then you say ‘Oh shut up’ or ‘It’s okay, don’t talk to me anymore’ and then they’re like ‘Okay I won’t talk to you anymore’ and then they’re talking over you without them really knowing so those sort of interactions can be kind of weird so we thought about adding these shortcuts with the keyboard where you can mute, sort of create this pre-created prompts.”
Creating realistic AI avatars goes beyond technical synchronization
A major challenge is emotion rendering, ensuring the avatar’s reactions align naturally with the emotional tone of the conversation. Hector explains that avatars with emotion rendering can adapt so that “if you say something sad, the avatar is not going to laugh. So these are usually additional models. There are some avatar services, like Tavus is one of them that they produce their own emotion, perception and rendering models.”
This kind of emotional rendering is especially important in certain use cases like therapy, as Hector pointed out.
Latency is often the primary bottleneck in building voice AI systems
Minimizing the processing between user input and AI response is essential to reduce delays and maintain a natural conversational flow. Hector explains, “I think the first thing to consider is how much processing do we want to add between the time we get the user’s input and the time we produce an output because each processing is going to add up in the latency budget. So the less we do in that regard, that’s going to help us on having a natural conversation flow. That’s one of the first thing to manage. And also, we want to have monitoring and observability on how that behaves over time and try to catch any performance issue that could raise up. So that’s the first thing.”
Up Next! WebRTC Live Episode 104:
Why Vision Language Models Deserve a Closer Look
Wednesday, July 16 at 12:30 pm EDT