Video AI Agent: Vonage Video Connector SDK + Pipecat Transport

Vonage recently released the Video Connector SDK and a companion Pipecat Transport. These new tools let a server-side AI agent join a Vonage Video session as a full participant, with real-time audio and video. Together they make it possible to build an AI Avatar that doesn’t just chat, but actually listens and speaks inside a live video call.

That shift matters because most AI interaction today is still text-based — a chatbot as the first point of contact in a customer service flow. Video changes the shape of that interaction entirely. It’s like the jump from radio to television, except this time the TV talks back.

The use cases are already taking shape: live commerce co-hosts that demo products and answer questions in real time, virtual tutors that adapt to a student’s facial expressions, customer support avatars that feel more human than a hold queue, and telehealth assistants that can triage before a doctor joins.

In this post, you’ll use the Vonage Video Connector SDK and Pipecat Transport to build a Video AI agent that joins a Vonage Video session with:

Real-time speech-to-text using Deepgram
LLM-powered reasoning through Amazon Bedrock
Natural text-to-speech from ElevenLabs
A lip-synced talking avatar from Simli

All of it orchestrated with Pipecat, an open-source framework by Daily for building voice and multimodal AI agents. By the end, you’ll have a working AI participant that joins a video call, listens, thinks, and responds with a realistic avatar.

Prerequisites

Before you start, make sure you have the following:

Python 3.13+ — the Vonage Video Connector SDK native library requires Python 3.13 on Linux AMD64 or ARM64
A Vonage Video API account — sign up here and create a Video API application in the dashboard
AWS credentials with access to Amazon Bedrock (Claude Haiku model enabled in your region)
A Deepgram API key — sign up for free for speech-to-text
An ElevenLabs API key and voice ID — create an account for text-to-speech
A Simli API key and face ID — sign up at Simli for the lip-synced avatar

Install the project dependencies:

pip install "pipecat_ai[deepgram,elevenlabs,aws,simli,vonage-video-connector]\ 
   @ https://github.com/Vonage/pipecat/releases/download/v1.3.0.post1/pipecat_ai-1.3.0.post1-py3-none-any.whl"
pip install vonage-video aiohttp pydantic python-dotenv boto3

Create a .env file with your credentials:

# Vonage Video API
VONAGE_APPLICATION_ID=your_app_id
VONAGE_PRIVATE_KEY_PATH=private.key

# AWS (for Bedrock)
AWS_REGION=us-east-1

# Deepgram STT
DEEPGRAM_API_KEY=your_deepgram_key

# ElevenLabs TTS
ELEVENLABS_API_KEY=your_elevenlabs_key
ELEVENLABS_VOICE_ID=your_voice_id

# Simli Avatar
SIMLI_API_KEY=your_simli_key
SIMLI_FACE_ID=your_face_id

Architecture: How a Video AI Agent Pipeline Works End-to-End

Here’s the high-level data flow:

Browser (Participant) ↔ Vonage Video SFU ↔ Pipecat Pipeline (AI Agent)

The participant’s browser connects to a Vonage Video session via WebRTC. The AI agent also connects to that same session using the Vonage Video Transport for Pipecat. Audio from the participant flows into a Pipecat pipeline, gets transcribed, processed by an LLM, converted to speech, rendered as avatar video, and sent back into the session.

The flow for an AI Video Agent in a Vonage Video API video session. — The flow for a Video AI agent in a Vonage Video API video session.

This is what makes it a real-time voice AI pipeline. Each stage in the pipeline processes data as streaming frames. Audio comes in continuously, gets transcribed word by word, the LLM streams its response token by token, TTS generates audio in chunks, and the avatar renders video frame by frame. This streaming approach keeps latency low and makes the conversation feel natural.

Here’s the pipeline definition in code:

pipeline = Pipeline(
    [
        transport.input(),           # Audio from Vonage session participants
        stt,                         # Deepgram speech-to-text
        context_aggregator.user(),   # Accumulate user context
        llm,                         # Bedrock Claude generates response
        tts,                         # ElevenLabs text-to-speech
        simli,                       # Simli generates lip-synced avatar video
        transport.output(),          # Send avatar audio + video to Vonage session
        context_aggregator.assistant(),
    ]
)

Pipecat handles the streaming, buffering, interruption detection, and synchronization between services.

The Vonage Video Transport for Pipecat: Connecting WebRTC to Your AI Pipeline

The Vonage Video Transport for Pipecat is the official transport that bridges Vonage Video sessions with Pipecat’s frame-based pipeline. It handles all the complexity of WebRTC connectivity, audio/video format conversion, stream subscription, and session lifecycle — so you can focus on building your AI logic.

Under the hood, the transport uses the Vonage Video Connector SDK, a native library that manages the WebRTC connection. The transport wraps this in Pipecat’s BaseInputTransport and BaseOutputTransport interfaces, handling the thread-safe bridging between native C callbacks and Python’s asyncio event loop automatically.

Setting Up the Transport

Install the transport as a Pipecat extra:

pip install "pipecat-ai[vonage-video-connector]"

Then configure it in your agent:

from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.transports.vonage.video_connector import (
    VonageVideoConnectorTransport,
    VonageVideoConnectorTransportParams,
)

transport = VonageVideoConnectorTransport(
    application_id=application_id,
    session_id=session_id,
    token=token,
    params=VonageVideoConnectorTransportParams(
        audio_in_enabled=True,
        audio_in_sample_rate=16000,
        audio_in_channels=1,
        audio_out_enabled=True,
        audio_out_sample_rate=16000,
        audio_out_channels=1,
        video_out_enabled=True,
        video_out_width=512,
        video_out_height=512,
        video_out_color_format="RGB",
        publisher_name="AI Agent",
        audio_in_auto_subscribe=True,
        vad_analyzer=SileroVADAnalyzer(),
        clear_buffers_on_interruption=True,
    ),
)

A few things to note about the configuration:

Automatic stream subscription. With audio_in_auto_subscribe=True, the transport automatically subscribes to audio from any participant that joins the session. No manual callback wiring needed — when someone speaks, their audio flows into your pipeline.
Video format handling. You specify video_out_color_format="RGB" and the transport handles the pixel format conversion internally. Pipecat’s Simli service outputs RGB frames, and Vonage expects ARGB32 — the transport takes care of this without you needing to write any conversion code.
Voice Activity Detection. The SileroVADAnalyzer detects when the participant is speaking vs. silent. This optimizes processing by only sending audio to the STT service when there’s actual speech, and it enables interruption detection.
Buffer clearing on interruption. When clear_buffers_on_interruption=True, if the user starts speaking while the agent is mid-response, any queued audio/video that hasn’t been sent yet is discarded. This makes the agent stop talking immediately rather than finishing its current sentence.

Session Lifecycle

Unlike lower-level approaches where you manually manage connect/disconnect calls, the official transport integrates its lifecycle with Pipecat’s PipelineRunner. You just call runner.run(task) and the transport handles connecting to the session, publishing, subscribing to streams, and disconnecting when the pipeline ends:

runner = PipelineRunner()
await runner.run(task)

Event Handlers

The transport exposes session events that let you react to participants joining or leaving:

@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, data):
    # Greet when the first participant connects
    await asyncio.sleep(2.0)
    await task.queue_frames([TTSSpeakFrame("Hello! What can I help you with today?")])

Available events include on_joined, on_left, on_first_participant_joined, on_participant_joined, on_participant_left, on_client_connected, and on_client_disconnected. These give you hooks into the full session lifecycle without needing to manage raw SDK callbacks.

Wiring Up AI Services: STT, LLM, TTS, and Avatar Generation

With the transport in place, configuring the AI services is straightforward. Each one plugs into the pipeline as a processor that consumes frames from the previous stage and produces frames for the next:

# Speech-to-Text
stt = DeepgramSTTService(api_key=os.environ["DEEPGRAM_API_KEY"])

# LLM (AWS Bedrock Claude)
llm = AWSBedrockLLMService(
    aws_region=os.environ.get("AWS_REGION", "us-east-1"),
    settings=AWSBedrockLLMService.Settings(
        model=os.environ.get("BEDROCK_MODEL_ID", "us.anthropic.claude-haiku-4-5-20251001-v1:0"),
        max_tokens=100,
    ),
)

# Text-to-Speech
tts = ElevenLabsTTSService(
    api_key=os.environ["ELEVENLABS_API_KEY"],
    settings=ElevenLabsTTSService.Settings(voice=os.environ["ELEVENLABS_VOICE_ID"]),
)

# Lip-synced Avatar
simli = SimliVideoService(
    api_key=os.environ["SIMLI_API_KEY"],
    face_id=os.environ["SIMLI_FACE_ID"],
)

A few things worth noting:

Streaming, not batching. Pipecat doesn’t wait for the LLM to finish its entire response before starting TTS. As soon as the first tokens arrive, they flow into TTS service. As soon as audio chunks come back, they flow into the avatar rendering service. This pipelining keeps end-to-end latency tight.
Interruption handling. If the user starts speaking while the agent is mid-response, Pipecat detects this and cancels the current generation. The agent stops talking and starts listening again. This is configured with a single parameter:

task = PipelineTask(
    pipeline,
    params=PipelineParams(allow_interruptions=True),
)

Context management. The LLMContextAggregatorPair keeps track of the conversation history, accumulating user messages and assistant responses so the LLM has full context for each turn.

# --- Context / Aggregators ---
    context = LLMContext(
        messages=[{"role": "system", "content": system_instruction}],
    )
    context_aggregator = LLMContextAggregatorPair(context)

Demo Time

The end result is an AI Video Agent ready to interact with customers based on an appropriate context and the right Voice AI system to support the use case. Here’s what the finished AI avatar video call looks like in action:

Extending Your Video AI Agent

The Vonage Video Transport for Pipecat makes it simple to drop an AI participant into any video session. The official transport handles all the complexity of WebRTC connectivity, native thread bridging, audio/video format conversion, and stream subscription while Pipecat manages the streaming, interruption, and service orchestration.

From here, you could extend this in several directions:

Multi-agent sessions. Multiple AI participants with different roles in the same call
Tool use. Give the agent function-calling capabilities to look up data, book appointments, or control external systems
Vision input. Process the participant’s video feed so the agent can see what’s happening (screen sharing, product demos, whiteboarding)
Multilingual support. Detect the participant’s language and respond in kind

The building blocks are modular. Swap STT providers, replace the LLM, or use a different avatar service. The pipeline structure stays the same.

Take Your Video AI Agent to Production

You’ve seen how the Vonage Video Connector SDK and Pipecat Transport come together to build a real-time AI participant. Production deployments involve more: scaling across sessions, handling edge cases in interruption and network conditions, and integrating with your existing systems.

As a Vonage Development Partner, WebRTC.ventures has deep, hands-on expertise with the Vonage Video API and the broader real-time communication stack. Whether you’re building AI Video Agents for live commerce, virtual tutors, customer support, or telehealth, our team can help you go from prototype to production. Contact us today and let’s take your Video AI Agent to production.

Building a Video AI Agent with Vonage Video Connector SDK and Pipecat Transport.

Prerequisites

Architecture: How a Video AI Agent Pipeline Works End-to-End

The Vonage Video Transport for Pipecat: Connecting WebRTC to Your AI Pipeline

Setting Up the Transport

Session Lifecycle

Event Handlers

Wiring Up AI Services: STT, LLM, TTS, and Avatar Generation

Demo Time

Extending Your Video AI Agent

Take Your Video AI Agent to Production

Building Multi-Agent Voice AI: Real-Time Orchestration Lessons from a Clinical Training Simulator

Scaling Janus WebRTC Server: Building a Media Resource Broker

Migrating from Kurento to LiveKit in Production: A Real-World Case Study

AI Tinkerers San Salvador: Spec-Driven Development, Hand Tracking, and a Linux Kernel Patch

Recent Blog Posts

Building Multi-Agent Voice AI: Real-Time Orchestration Lessons from a Clinical Training Simulator

Scaling Janus WebRTC Server: Building a Media Resource Broker

Migrating from Kurento to LiveKit in Production: A Real-World Case Study

Building a Video AI Agent with Vonage Video Connector SDK and Pipecat Transport

We’re one of the few agencies in the world dedicated to WebRTC development. This dedication and experience is why so many people trust us to help bring real-time application dreams to life.

Let's get started!

Contact us today

Join our mailing list!

Categories

Prerequisites

Architecture: How a Video AI Agent Pipeline Works End-to-End

The Vonage Video Transport for Pipecat: Connecting WebRTC to Your AI Pipeline

Setting Up the Transport

Session Lifecycle

Event Handlers

Wiring Up AI Services: STT, LLM, TTS, and Avatar Generation

Demo Time

Extending Your Video AI Agent

Take Your Video AI Agent to Production

Recent Blog Posts

Recent Blog Posts

We’re one of the few agencies in the world dedicated to WebRTC development. This dedication and experience is why so many people trust us to help bring real-time application dreams to life.