Reducing Voice Agent Latency with Parallel SLMs and LLMs

One of the biggest challenges in building real-time AI voice agents is the delay between when a user finishes speaking and when the system responds, known as latency. Even small delays in a Voice AI application can disrupt the natural flow of conversation and harm your user experience.

To consistently achieve the kind of sub-second responsiveness necessary for human-like conversational flow without sacrificing answer quality, the webrtc-agent-livekit project takes a holistic approach. It combines the real-time capabilities of the WebRTC-based LiveKit Agent framework, low latency transcription models, and a parallel execution strategy: a fast-response Small Language Model (SLM) to deliver a quick initial reply, and a more advanced Large Language Model (LLM) to generate a detailed follow-up.

This blog post explores the implementation of a parallel LLM architecture using the LiveKit WebRTC framework. We’ll walk through:

How the system works using LiveKit Agents and low-latency transcription tools
The logic behind parallel model execution
Implementation details including asynchronous Python code
Configuration tips for optimizing your SLM

Observability insights using Grafana will be the subject of a follow-up post.

Why Latency Matters for Voice Agents

Delays between a user’s question and an AI agent’s response can break conversational flow, make an interaction feel unnatural, and cause users to disengage. A recent study by the International Journal of Human-Computer Interaction, The Influence of Response Time and Feedback Type on the User Experience of Voice Interaction among Older Adults, revealed that delays over 2 seconds noticeably reduce trust and usability.

This means that for truly conversational experiences, voice agents should respond in under one second. This level of responsiveness helps keep interactions fluid and human-like.

But hitting that target isn’t easy with so many factors introducing latency.

The Architecture Behind SLM + LLM Execution

This implementation is built on the LiveKit Agents framework which provides a powerful foundation for building real-time Voice AI agents. LiveKit handles the WebRTC communication, allowing the AI avatar to participate in audio conversations with low latency and high reliability. In this architecture, the LiveKit media server is running locally.

The LiveKit Agents framework includes:

Audio streaming and processing capabilities
Speech-to-text (STT) integration with various providers
LLM integrations with open source or proprietary alternatives like Ollama or OpenAI
Text-to-speech (TTS) capabilities for voice output
Voice activity detection (VAD) and a small turn detection model for determining when users finish speaking
Session management for handling conversations

The core of our implementation is in the agent-worker directory, specifically in the fast-preresponse.py and fast-preresponse-ollama.py files. These files are two examples that contain the logic for our parallel LLM approach, acting as a bridge between the user’s audio input and the language models.

The agent-worker processes incoming audio, detects when the user has finished speaking (using a new turn detection model: LiveKit turn detector plugin | LiveKit Docs), transcribes the speech (using a 3rd party platform, Deepgram), generates responses using the language models (using OpenAI), and converts those responses back to speech (also using OpenAI). This entire pipeline is optimized for minimal latency at each step.

Keep in mind that using the turn detection model adds an additional latency of approximately 100-200ms.

Parallel Model Execution

The strategy here is the parallel execution of two language models:

Small Language Model (SLM): A lightweight, fast model that can generate initial responses quickly
Large Language Model (LLM): A more powerful model that generates higher-quality, more comprehensive responses

Instead of waiting for the larger model to complete its response, we immediately begin generating a response with the smaller model. This allows us to start speaking much sooner, reducing the perceived latency for the user.

Streaming Response Generation

Both the small and large language models use a streaming architecture, where tokens are generated and processed incrementally rather than waiting for the complete response. This further reduces latency by allowing the system to begin speaking as soon as the first few tokens are available.

The code snippet below from the implementation shows how the streaming is handled:

#Let LLM reply to be aware of this "silence filler"
#response
fast_llm_fut = asyncio.Future[str]()

async def _fast_llm_reply() -> AsyncIterable[str]:
    filler_response: str = ""
    async for chunk in self._fast_llm.chat(
        chat_ctx=fast_llm_ctx
    ).to_str_iterable():
        filler_response += chunk
        yield chunk
    fast_llm_fut.set_result(filler_response)

# We don't need to add this quick filler in the
# context
self.session.say(
    _fast_llm_reply(),
    add_to_chat_ctx=False
)

filler_response = await fast_llm_fut
logger.info(
    f"Fast response: {filler_response}"
)
turn_ctx.add_message(
    role="assistant",
    content=filler_response,
    interrupted=False
)

This asynchronous approach allows the SLM to generate quick responses immediately passing them to the TTS system while also collecting the complete response through the standard LLM function.

Grafana dashboard visualizing total conversation latency, calculated as: end-of-utterance delay (including transcription) + LLM time-to-first-token + TTS time-to-first-byte. In this example, the LLM takes nearly 900 ms to produce its first token, while the SLM (labeled “Fast LLM”) responds in just 329 ms. The user would have experienced a total perceived latency of 1.29 seconds.

Configuring the SLM for Quick Responses

The SLM is configured to provide quick, concise responses. Since its primary purpose is to reduce initial response latency, we optimize for speed rather than comprehensive answers.

Key configuration aspects:

A smaller context window to reduce processing time (we even truncate user input)
A lower temperature for more deterministic responses
Potentially, a specialized prompt that encourages brief initial responses

Managing the Parallel Execution of Models

The parallel execution is managed through Python’s asynchronous programming features. The implementation uses asyncio Asynchronous I/O to run both models concurrently without blocking the main thread.

The process works as follows:

When the user finishes speaking, both models are initiated simultaneously
As soon as the SLM begins generating tokens, they are sent to the TTS system
The user hears the SLM’s response while the LLM continues generating its more comprehensive answer. But the SLM response is not added to the chat context (add_to_chat_ctx=False)
The final LLM response is added to the conversation context for continuity

A Practical Example of a Basic Agentic System

This architecture goes beyond just reducing latency. It’s a practical example of a basic agentic system. By combining fast transcription, the LiveKit Agent framework, and parallel language models, it enables voice agents to respond quickly while still producing high-quality answers.

Key takeaways:

Fast response time is critical for natural conversations. Sub-second replies help maintain flow and make the agent feel more human.
Parallel execution of SLM and LLM allows the system to deliver an immediate short reply, then follow up with a more thoughtful, accurate one.
Agent-like behavior emerges from how the system listens, responds, and updates the conversation context asynchronously. It reacts in real time, without blocking or waiting on long computations. Allowing for fully customizable and complex flows.
There’s room for optimization: current architecture relies on multiple third-party APIs, some of which (like OpenAI TTS) introduce cloud latency. Co-locating services or running them on edge infrastructure could cut that down significantly.

As language models continue to improve, this architecture offers a strong foundation for building more advanced, real-time voice agents that are both responsive and intelligent—moving us closer to voice interfaces that feel less like tools and more like collaborators.

Demo time!

Toward More Responsive, Real-Time Voice Agents

Reducing latency in AI-powered voice agents isn’t just a technical optimization. It is essential to creating natural, human-like interactions. By combining fast transcription, WebRTC streaming via LiveKit, and parallel execution of Small and Large Language Models, you can significantly improve responsiveness without sacrificing the quality of your AI’s replies.

This architecture not only solves a key UX challenge but also lays the groundwork for more advanced, agentic behavior in real-time applications. With tools like asynchronous streaming, real-time audio pipelines, and intelligent turn detection, we’re moving closer to voice interfaces that feel truly collaborative.

If you’re building a Voice AI solution and want to explore how technologies like LiveKit and WebRTC can help you deliver faster, smarter experiences—our expert team at WebRTC.ventures can help. We’re specialists in real-time communications and agent architectures, and we’re proud to be a trusted LiveKit development partner.

Contact us at WebRTC.ventures to bring your real-time voice agent to life!

Why Latency Matters for Voice Agents

The Architecture Behind SLM + LLM Execution

Parallel Model Execution

Streaming Response Generation

Configuring the SLM for Quick Responses

Managing the Parallel Execution of Models

A Practical Example of a Basic Agentic System

Demo time!

Toward More Responsive, Real-Time Voice Agents

Watch WebRTC Live #103: Technical and UX Approaches for Integrating Voice AI into WebRTC Apps

How to Deploy STUNner as a WebRTC STUN/TURN Server on Kubernetes

How AI Development Tools Are Transforming Our Software Engineering Process

Talking to Yourself Without Looking Crazy: Building a Voicebot with Your Cloned Voice Using Cartesia and LiveKit Agents

Recent Blog Posts

Reducing Voice Agent Latency with Parallel SLMs and LLMs

Watch WebRTC Live #103: Technical and UX Approaches for Integrating Voice AI into WebRTC Apps

How to Deploy STUNner as a WebRTC STUN/TURN Server on Kubernetes

How AI Development Tools Are Transforming Our Software Engineering Process

We’re one of the few agencies in the world dedicated to WebRTC development. This dedication and experience is why so many people trust us to help bring live video application dreams to life.

Let's get started!

Contact us today

Join our mailing list!

Categories

Why Latency Matters for Voice Agents

The Architecture Behind SLM + LLM Execution

Parallel Model Execution

Streaming Response Generation

Configuring the SLM for Quick Responses

Managing the Parallel Execution of Models

A Practical Example of a Basic Agentic System

Demo time!

Toward More Responsive, Real-Time Voice Agents

Recent Blog Posts

Recent Blog Posts

We’re one of the few agencies in the world dedicated to WebRTC development. This dedication and experience is why so many people trust us to help bring live video application dreams to life.