We built a Live Sales AI Presenter to model what a real-time AI avatar system looks like when it operates inside a real workflow rather than a demo sandbox. It is a slide-aware AI sales presenter combining deck ingestion, presentation control, live Q&A, WebRTC media, Pipecat orchestration, OpenAI Realtime, and HeyGen live avatar. You can watch the walkthrough here and the code is here.
What the prototype makes visible is that a real-time AI avatar system is not one thing. It is six distinct layers, each with its own failure modes, latency profile, vendor considerations, and production requirements. The avatar is the visible layer. The system behind it: state management, orchestration, media transport, observability, compliance, and escalation, is where the Voice AI engineering work lives.
In a previous post, Don’t Mistake the AI Avatar for the Voice AI System Behind It, we covered what to evaluate when selecting an AI avatar platform. This post is about what you are actually building once that decision is made. It walks through what we built, why each layer exists, and what a production deployment would add to each one. The goal is not to show a finished system, rather to show why Voice AI systems are more complex than they appear, and where that complexity concentrates.

The prototype: a slide-aware AI sales presenter
The Live Sales AI Presenter is a focused use case: an AI avatar that presents a slide deck, handles live questions from the audience, and navigates the presentation in response to both the user and the model. It is the kind of workflow-specific application that enterprises are actually likely to build or buy: not a general-purpose chatbot with an avatar face, but an agent operating inside a defined context with defined boundaries.
Getting that to work requires six distinct layers operating together in real time.
Layer 1: Deck ingestion and slide state (FastAPI backend)
Before the avatar can present anything, the system needs to know what is in the deck. That means ingesting the slides, extracting content, indexing it for semantic search, and making it available to the model as grounded context during the session.
The FastAPI backend owns this. It also owns session state — which slide is active, what has been said, what tools the agent has called, and what the user has asked. This is deliberate. The model can request a slide transition or search the deck for an answer, but it does not mutate state directly. The application validates the request, executes it, and records it.
When the realtime model decides to call a slide tool, the Pipecat layer intercepts the request and delegates the actual state change to the backend:
def dispatch_tool_call(session_id, tool_name, args):
if tool_name == "next_slide":
return api.next_slide(session_id)
if tool_name == "goto_slide":
return api.goto_slide(session_id, args["slide_index"])
if tool_name == "search_slides":
return api.search_slides(session_id, args["query"])
raise ValueError(f"Unsupported tool: {tool_name}")
The model requests an action. The application validates, executes, and records it.
In production this layer needs managed database infrastructure, object storage for deck assets, session expiration, access control, and audit logging. It also needs to survive failure: page refreshes, dropped connections, backend restarts, and multi-user edge cases all need explicit handling.
Layer 2: Real-time voice interaction (OpenAI Realtime)
The avatar needs to hear the user and respond in real time. OpenAI Realtime handles the live voice interaction: speech-to-text, language model inference, and text-to-speech in a single low-latency WebRTC pipeline.The integration points that matter here are interruption handling and tool calls.
In production, teams need to select the right region, network routing and measure end-to-end latency across the full path. Voice detection, orchestration overhead, avatar rendering, and WebRTC transport all contribute to what the user actually experiences.
Layer 3: Orchestration (Pipecat)
Pipecat sits between the voice pipeline and the application. It manages the real-time audio streams, routes tool calls to the backend, handles interruptions, and coordinates the sequencing of model output and avatar rendering.
This is the layer most teams underestimate. Without an orchestration layer, the logic for managing concurrent audio, tool dispatch, interruption recovery, and session state ends up scattered across the application in ways that become difficult to test, debug, or maintain. Pipecat provides a structured way to handle that complexity, and because it is open source, it avoids vendor lock-in at the layer where business logic lives.
In production, the orchestration layer also needs retry logic, error recovery, structured logging, geographic distribution and the ability to reconstruct session events for debugging and compliance purposes.
Layer 4: Avatar rendering (HeyGen)
HeyGen takes the audio output from the voice pipeline and renders the talking-head video in real time. From the system’s perspective it is a rendering service: it receives audio and returns a video stream synchronized to that audio.
The integration considerations are latency and synchronization. Any perceptible delay between the audio and the avatar’s lip movement breaks the experience.
In production, avatar rendering also raises questions about fallback behavior when the rendering service is unavailable. These are decisions that need to be made before launch, not after.
Layer 5: Media transport (WebRTC)
WebRTC is the right transport choice for real-time conversational AI. It provides the low-latency, bidirectional media path that live interaction requires.
In the prototype, WebRTC handles the media path between the browser and the backend using Pipecat SmallWebRTC plugin built on the Python aiortc library. At a single-session scale this is manageable. Production WebRTC is a separate engineering problem: TURN server infrastructure, ICE behavior under restrictive firewalls, high bandwidth requirements (specially for video), bitrate adaptation, and quality monitoring across real network conditions.
The peer-to-peer path that works in a demo might not be the architecture that works in production. Teams can self-host the WebRTC infrastructure or use a managed provider such as Daily. Either way, WebRTC at scale requires dedicated expertise in connectivity, media quality, observability, and failure recovery.
Layer 6: The frontend (Next.js)
The Next.js frontend handles the user experience: slide display, avatar video, audio capture, and session controls. It is also where the WebRTC connection originates and where real-time session events surface to the user.
In production the frontend needs to handle connection recovery gracefully, manage browser permissions reliably across devices and operating systems, and give users meaningful feedback when something goes wrong. Browser or app performance under simultaneous WebRTC media and avatar video rendering needs to be tested across target devices before launch.
What the AI Avatar prototype is missing
The prototype demonstrates the architecture. It does not yet have several things a production deployment would require.
- Evals. Answer accuracy, grounding, interruption handling, tool-call correctness, and recovery behavior all need test coverage before this system goes in front of real users. For real-time systems, evals also need to cover timing and workflow behavior, not just whether the model produces the right text.
- Tenant isolation and access control. In an enterprise deployment, every session, transcript, tool call, and generated asset needs a clear ownership model with enforced access boundaries. This is a compliance requirement, not an engineering preference.
- Human escalation. Every production voice AI system needs a defined handoff point with context: session history, transcript, current state, and attempted answers. A human agent receiving only “the user needs help” cannot pick up the conversation effectively.
- Compliance controls. PII redaction, retention policies, audit trails, and access logs are out of scope for a prototype and prerequisites for any regulated industry deployment.
- Managed infrastructure. SQLite and local storage are appropriate for development. Production requires managed database infrastructure, object storage, session expiration, retry logic, reconnection handling, and media cleanup.
Voice AI architecture is complex. You don’t have to figure it out alone.
At WebRTC.ventures, we have spent years building production real-time AI systems: Voice AI architecture, WebRTC media paths, orchestration layer design, observability, compliance, and long-term maintenance. We know where these projects stall, which decisions have long-term consequences, and how to move from a working prototype to a system that holds up in production.
If you are evaluating a real-time AI avatar system or working through the decisions in this post, let’s talk.
Further Reading:
- Don’t Mistake the AI Avatar for the Voice AI System Behind It
- Context Engineering Best Practices for Voice AI Agents
- Voicebot Platforms and Strategy for Non-Tech Teams
- Production Voice AI Architecture for Regulated Industries
- Bedrock vs Vertex vs LiveKit vs Pipecat: Choosing a Voice AI Agent Production Framework
- QA Testing for AI Voice Agents: A Real-Time Communication QA Framework