Prompt engineering gets you a demo. Context engineering gets you a production Voice AI agent.

Think of LLMs as the world’s most brilliant librarians: they’ve read almost everything ever written, but without your help, they have the short-term memory of a goldfish. For text-based chatbots, a forgetful LLM is annoying. For a real-time voice agent, it kills the experience entirely. This is because the goal for Voice AI is human-level fluidity with sub-second latency. To get there, prompt engineering isn’t enough anymore. We need context engineering.

In this post, we dive into LLM context management best practices for Voice AI agents and how to manage context effectively in production.

What Is Context in a Voice AI Agent?

In a conversation, context is the invisible thread that connects what was said five minutes ago to what is being said now. For an LLM, context is composed of a multi-layered stack of information:

  • Static Context: The system prompt that defines the agent’s persona and the rules of engagement.
  • Dynamic Context: RAG (Retrieval-Augmented Generation) from knowledge bases, real-time data from MCP (Model Context Protocol) servers, and available tools.
  • Live Context: The live transcript, the sentiment of the user’s voice, and the results of any tools the agent has used.

Think of the LLM as a reasoning engine, not a database. It doesn’t need to “store” the data forever; it just needs the right data at the exact moment it needs to make a decision.

Why a Full LLM Context Window Hurts Voice AI Performance

It is tempting to just shove everything into the context window. After all, modern models can handle from 128k up to 2M tokens. Why not just give it everything?

There are three reasons why a full LLM context window fails:

  1. Instruction Drift: LLMs suffer from “Lost in the Middle.” When the context gets too crowded, the model starts to prioritize the recent conversation history over your initial system instructions. Suddenly, your professional support agent is acting like it’s in the Wild West.
  2. The Voice-Specific Penalty: In real-time communication applications, latency is the only metric that truly matters, at least from the perspective of the users interacting with your agents. More tokens in the context window mean a higher Time to First Token (TTFT). If your agent takes three seconds to “think” because it’s reading a 50-page transcript, the user will think the call has dropped.
  3. Cost: Every turn costs money. Processing a massive context window for a simple “Hello” is like using a sledgehammer to crack a nut.

LLM Context Management Strategies for Voice AI Agents

To keep your Voice AI sharp, you need a strategy for what the LLM should keep and what it should forget. The examples throughout this post use Pipecat, an open-source framework for building real-time voice AI agents, but the strategies apply regardless of your stack.

Quick Tip: Decouple Call State from Conversation History

One quick tip (if applicable to your use case) is to store any relevant data for the interaction beforehand in a JSON “State” object separate from it. This object will contain essential information such as contact details, account IDs, past transactions, and anything that can be retrieved from existing knowledge. This state will be added to the context on each turn, and it can be kept forever while being much more aggressive about pruning the actual transcript.

By keeping the essentials as a standalone piece of information, you’ll be able to reduce conversation turns, as the relevant information will be already known, while allowing you to trim and adjust the remaining conversation context as required.

For instance, if using Pipecat, you can inject the state into system_instruction, which is prepended to every LLM request automatically. If the state changes, an LLMUpdateSettingsFrame refreshes it mid-conversation.

@dataclass
class CallState:
    user_id: str | None = None
    account_tier: str | None = None
    verified: bool = False

    def to_json(self) -> str:
        return json.dumps(asdict(self), indent=2)

def build_system_instruction(state: CallState) -> str:
    return (
        "You are a helpful customer-support voice agent.\n\n"
        "## Current Call State\n"
        f"```json\n{state.to_json()}\n```\n\n"
        "Use the call state to personalize your responses. "
        "Do not ask for information you already have."
    )

async def sync_state_to_llm(state: CallState, task: PipelineTask) -> None:
    await task.queue_frame(
        LLMUpdateSettingsFrame(
            delta=OpenAILLMService.Settings(
                system_instruction=build_system_instruction(state),
            )
        )
    )

# State is populated from event handlers
@transport.event_handler("on_client_connected")
async def on_connected(transport, client):
    state.user_id = "usr_12345"
    state.account_tier = "premium"
    await sync_state_to_llm(state, task)

With the crucial information decoupled from the live context, let’s take a look at some strategies for Voice AI agent context management.

Strategy 1: The Sliding Window (Recency Bias)

The sliding window strategy keeps only the last N exchanges in the context. It keeps latency low and predictable. It’s great for simple tasks, but be careful: if the call goes long, the agent might forget why it was talking to the user in the first place.

Pipecat hooks into the on_assistant_turn_stopped event. After each turn, it checks the context size and replaces it with only the most recent messages.

MAX_TURNS = 10
MAX_MESSAGES = MAX_TURNS * 2

@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def _on_assistant_turn_stopped(aggregator, message):
    messages = context.messages

    if len(messages) <= MAX_MESSAGES + 1:
        return

    system_prompt = []
    if messages and messages[0].get("role") == "system":
        system_prompt = [messages[0]]

    trimmed_history = messages[-MAX_MESSAGES:]

    if trimmed_history and trimmed_history[0].get("role") == "tool":
        trimmed_history = trimmed_history[1:]

    new_messages = system_prompt + trimmed_history
    await task.queue_frames([LLMMessagesUpdateFrame(new_messages)])

Strategy 2: Auto-Summarization

Every few turns, have a background process summarize the conversation so far. You then replace the raw transcript with this “Executive Summary.” This preserves the intent of the conversation without the token bloat. It’s the difference between reading a book and reading the Cliff Notes.

Pipecat has this built in. Enable enable_auto_context_summarization=True on the assistant aggregator and Pipecat automatically compresses older messages while preserving recent ones.

Strategy 3: Context Resets & Milestones

Identify logical “checkpoints” in your workflow. Once a user has successfully passed identity verification, you don’t need the three minutes of back-and-forth about their mother’s maiden name anymore. Flush that buffer and move to the next stage with a “Verification Successful” flag.

In Pipecat, it is possible to use a single agent that flushes its own context at each milestone. When verification completes, LLMMessagesUpdateFrame([]) wipes the transcript and LLMUpdateSettingsFrame swaps the system instruction for the next stage.

async def flush_and_transition(state, new_instruction, task):
    """Wipe the conversation context and update the system instruction."""
    await task.queue_frames([LLMMessagesUpdateFrame([])])
    await task.queue_frame(
        LLMUpdateSettingsFrame(
            delta=OpenAILLMService.Settings(system_instruction=new_instruction)
        )
    )

async def handle_verify(params: FunctionCallParams):
    name = params.arguments.get("name", "Unknown")
    state.caller_name = name
    state.stage = "support"
    await params.result_callback(f"Identity verified for {name}.")
    await flush_and_transition(
        state,
        new_instruction=f"You are a support agent. The caller ({name}) has been verified.",
        task=task,
    )

However, if you are able to identify these milestones, you might also want to consider moving from a single agent to multiple agents, as shown in the following strategy.

Strategy 4: Agentic Workflows

Sometimes the best way to manage context is to leave it behind. Instead of having “God-Agents” that (attempt to) know everything, use specialized agents. If a user moves from a billing question to a technical support issue, hand the call off to a new agent. Much like a worker in a specialized field, each agent starts with a clean, relevant-only context.

Pipecat Flows models this as a graph. Each node is a specialist with its own role, tools, and ContextStrategy.RESET. When the user asks for billing, the flow transitions to the billing node and the router’s transcript is discarded. Task messages use the "developer" role for node-specific instructions.

from pipecat_flows import NodeConfig, ContextStrategy, ContextStrategyConfig

def create_billing_node() -> NodeConfig:
    return NodeConfig(
        name="billing",
        role_message="You are a billing specialist at Acme Corp.",
        task_messages=[
            {"role": "developer", "content": "Ask the caller about their billing question."}
        ],
        functions=[back_to_menu_func],
        context_strategy=ContextStrategyConfig(strategy=ContextStrategy.RESET),
    )

Build Smarter Voice AI Agents by Engineering Context, Not Just Prompts

Context is the primary bottleneck for agentic performance. In the early days of AI, we focused on how to talk to models. Now, the challenge is how to help them remember efficiently. The most successful Voice AI applications won’t be the ones with the longest memory; they will be the ones that know exactly what is important and what is just noise.

After all, if you want your agent to stay on mission and avoid “glitching in the matrix,” you have to be the architect of its reality.

If you’re looking into optimizing your Voice AI Agent for sub-second latency and efficient context management, look not further than the Voice AI experts at WebRTC.ventures. Leverage our experience to boost your agent’s abilities. Contact us today!

Further Reading:

Recent Blog Posts