For organizations prioritizing data privacy and zero variable cloud costs related to inference, it is entirely possible to build a voice agent using off-the-shelf open source tools. In this post, we will outline a practical Voice AI stack that avoids vendor lock-in while still supporting real-time, natural conversations over WebRTC.

At WebRTC.ventures, we have seen clients ask for self-hosted and on-premise Voice AI architectures that they can control, audit, and tune for their own performance and compliance requirements. The open source stack below is a concrete example of how to achieve this using components you can run on your own infrastructure.

The Open Source Voice AI Stack

This architecture combines an open source orchestrator, local LLM host, and on-premise STT and TTS services into a cohesive real-time voice agent.​

Orchestration: Pipecat

Pipecat by Daily is a Python framework designed specifically for voice agents. It handles the difficult work of frame management, streaming media, and pipeline coordination between your ASR, LLM, and TTS services, so you can focus on conversation design and business logic instead of low-level plumbing.

LLM Host: Ollama with Llama 3.2

Ollama runs the Llama 3.2 model locally on your own hardware. This model is small enough to run fast on consumer or edge hardware but smart enough for many conversational tasks, especially when combined with a domain-specific prompt and tools.​

ASR: Speaches (faster-whisper)

Speaches is an OpenAI-API compatible server supporting multiple ASR models, including faster-whisper. It runs in a Docker container and offers significant speed improvements over the standard Whisper implementation, which is crucial to keep end-to-end latency low in real-time Voice AI applications.

Projects:

TTS: Kokoro

Kokoro is a high-quality open-weight model (82M params) that provides near-human synthesis at a fraction of the compute cost of larger models. Because it is open weight, you can deploy it on-premise through Speaches and tune the system for your own latency, voice quality, and cost requirements without being tied to a single SaaS provider.

How the Real-Time Voice Flow Works

From the user’s point of view, the interaction feels like a natural conversation with an AI agent in their browser. Under the hood, the flow looks like this:​

  1. Connection (WebRTC): The user connects via a browser using WebRTC, giving you low-latency, bidirectional audio transport with built-in echo cancellation and network adaptation.​
  2. Routing (Pipecat to Speaches): Pipecat receives the incoming audio stream and routes it to the local Speaches container for transcription.
  3. Inference (Ollama / Llama 3.2): The transcribed text is sent to Ollama, which runs Llama 3.2 locally; tokens are streamed back to the orchestrator as they are generated so the response can start as soon as possible.​
  4. Synthesis (Kokoro TTS): Pipecat buffers the incoming text stream and sends it to Kokoro, which synthesizes audio chunks in real time.​
  5. Playback (WebRTC return path): The audio is streamed back to the user via WebRTC, allowing for immediate playback and natural back-and-forth dialog.

You can find a working reference implementation of this architecture in this GitHub repository.

The “Aha” Moment: Zero Per-Token Cloud Costs

This architecture achieves a fully functional conversational agent with zero per-tokens cloud costs. You own the data, the model weights, and the infrastructure, which reduces exposure to changing SaaS pricing, rate limits, or sudden API deprecations.

Of course, performance will depend on your server specs and model selection. On consumer devices such as laptops, smaller models like llama3.2:1b will often be required to keep latency and resource usage within an acceptable range for real-time voice interactions.​

When to Choose Open Source Voice AI vs. Managed Voice AI Services

A self-hosted, open source Voice AI stack is ideal when you:

  • Need strict data privacy and control over logs, audio, and prompts (for example, healthcare or finance).
  • Expect predictable or high call volumes where per-minute pricing can quickly exceed the cost of maintaining your own infrastructure.

Managed Voice AI services can still be a great fit when you:

  • Need to go to market very quickly with minimal infrastructure work.
  • Prefer to offload scaling, observability, and uptime to a third party.
  • Are still experimenting with your use case and want to iterate on prompts and flows before committing to an on-premise deployment.

Many WebRTC.ventures clients end up testing their Voice AI use case with managed platforms and then migrating proven workloads to open source or hybrid architectures for cost and control.

Build Your Open Source Voice AI Agent with WebRTC.ventures

Building a reliable, low-latency, open source Voice AI agent takes more than wiring together Pipecat, Ollama, Speaches, and Kokoro. You still need to think carefully about WebRTC signaling, media routing, observability, security, and integration with your existing applications and back-end systems.

Our team at WebRTC.ventures has deep experience building production-ready Voice AI systems with both open source and managed stacks, including on-premise deployments using Llama, Ollama, Pipecat, and custom WebRTC media pipelines.

If you are exploring self-hosted or hybrid Voice AI architectures and want to avoid vendor lock-in while still delivering natural, real-time conversations, we can help you design, implement, and optimize the right solution for your use case. Contact WebRTC.ventures today to talk about your Voice AI roadmap.

Further Reading:

Recent Blog Posts