Vonage recently released the Video Connector SDK and a companion Pipecat Transport. These new tools let a server-side AI agent join a Vonage Video session as a full participant, with real-time audio and video. Together they make it possible to build an AI Avatar that doesn’t just chat, but actually listens and speaks inside a live video call.
That shift matters because most AI interaction today is still text-based — a chatbot as the first point of contact in a customer service flow. Video changes the shape of that interaction entirely. It’s like the jump from radio to television, except this time the TV talks back.
The use cases are already taking shape: live commerce co-hosts that demo products and answer questions in real time, virtual tutors that adapt to a student’s facial expressions, customer support avatars that feel more human than a hold queue, and telehealth assistants that can triage before a doctor joins.
In this post, you’ll use the Vonage Video Connector SDK and Pipecat Transport to build a Video AI agent that joins a Vonage Video session with:
- Real-time speech-to-text using Deepgram
- LLM-powered reasoning through Amazon Bedrock
- Natural text-to-speech from ElevenLabs
- A lip-synced talking avatar from Simli
All of it orchestrated with Pipecat, an open-source framework by Daily for building voice and multimodal AI agents. By the end, you’ll have a working AI participant that joins a video call, listens, thinks, and responds with a realistic avatar.
Prerequisites
Before you start, make sure you have the following:
- Python 3.13+ — the Vonage Video Connector SDK native library requires Python 3.13 on Linux AMD64 or ARM64
- A Vonage Video API account — sign up here and create a Video API application in the dashboard
- AWS credentials with access to Amazon Bedrock (Claude Haiku model enabled in your region)
- A Deepgram API key — sign up for free for speech-to-text
- An ElevenLabs API key and voice ID — create an account for text-to-speech
- A Simli API key and face ID — sign up at Simli for the lip-synced avatar
Install the project dependencies:
pip install "pipecat_ai[deepgram,elevenlabs,aws,simli,vonage-video-connector]\
@ https://github.com/Vonage/pipecat/releases/download/v1.3.0.post1/pipecat_ai-1.3.0.post1-py3-none-any.whl"
pip install vonage-video aiohttp pydantic python-dotenv boto3
Create a .env file with your credentials:
# Vonage Video API
VONAGE_APPLICATION_ID=your_app_id
VONAGE_PRIVATE_KEY_PATH=private.key
# AWS (for Bedrock)
AWS_REGION=us-east-1
# Deepgram STT
DEEPGRAM_API_KEY=your_deepgram_key
# ElevenLabs TTS
ELEVENLABS_API_KEY=your_elevenlabs_key
ELEVENLABS_VOICE_ID=your_voice_id
# Simli Avatar
SIMLI_API_KEY=your_simli_key
SIMLI_FACE_ID=your_face_id
Architecture: How a Video AI Agent Pipeline Works End-to-End
Here’s the high-level data flow:
Browser (Participant) ↔ Vonage Video SFU ↔ Pipecat Pipeline (AI Agent)
The participant’s browser connects to a Vonage Video session via WebRTC. The AI agent also connects to that same session using the Vonage Video Transport for Pipecat. Audio from the participant flows into a Pipecat pipeline, gets transcribed, processed by an LLM, converted to speech, rendered as avatar video, and sent back into the session.

This is what makes it a real-time voice AI pipeline. Each stage in the pipeline processes data as streaming frames. Audio comes in continuously, gets transcribed word by word, the LLM streams its response token by token, TTS generates audio in chunks, and the avatar renders video frame by frame. This streaming approach keeps latency low and makes the conversation feel natural.
Here’s the pipeline definition in code:
pipeline = Pipeline(
[
transport.input(), # Audio from Vonage session participants
stt, # Deepgram speech-to-text
context_aggregator.user(), # Accumulate user context
llm, # Bedrock Claude generates response
tts, # ElevenLabs text-to-speech
simli, # Simli generates lip-synced avatar video
transport.output(), # Send avatar audio + video to Vonage session
context_aggregator.assistant(),
]
)
Pipecat handles the streaming, buffering, interruption detection, and synchronization between services.
The Vonage Video Transport for Pipecat: Connecting WebRTC to Your AI Pipeline
The Vonage Video Transport for Pipecat is the official transport that bridges Vonage Video sessions with Pipecat’s frame-based pipeline. It handles all the complexity of WebRTC connectivity, audio/video format conversion, stream subscription, and session lifecycle — so you can focus on building your AI logic.
Under the hood, the transport uses the Vonage Video Connector SDK, a native library that manages the WebRTC connection. The transport wraps this in Pipecat’s BaseInputTransport and BaseOutputTransport interfaces, handling the thread-safe bridging between native C callbacks and Python’s asyncio event loop automatically.
Setting Up the Transport
Install the transport as a Pipecat extra:
pip install "pipecat-ai[vonage-video-connector]"
Then configure it in your agent:
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.transports.vonage.video_connector import (
VonageVideoConnectorTransport,
VonageVideoConnectorTransportParams,
)
transport = VonageVideoConnectorTransport(
application_id=application_id,
session_id=session_id,
token=token,
params=VonageVideoConnectorTransportParams(
audio_in_enabled=True,
audio_in_sample_rate=16000,
audio_in_channels=1,
audio_out_enabled=True,
audio_out_sample_rate=16000,
audio_out_channels=1,
video_out_enabled=True,
video_out_width=512,
video_out_height=512,
video_out_color_format="RGB",
publisher_name="AI Agent",
audio_in_auto_subscribe=True,
vad_analyzer=SileroVADAnalyzer(),
clear_buffers_on_interruption=True,
),
)
A few things to note about the configuration:
- Automatic stream subscription. With
audio_in_auto_subscribe=True, the transport automatically subscribes to audio from any participant that joins the session. No manual callback wiring needed — when someone speaks, their audio flows into your pipeline. - Video format handling. You specify
video_out_color_format="RGB"and the transport handles the pixel format conversion internally. Pipecat’s Simli service outputs RGB frames, and Vonage expects ARGB32 — the transport takes care of this without you needing to write any conversion code. - Voice Activity Detection. The SileroVADAnalyzer detects when the participant is speaking vs. silent. This optimizes processing by only sending audio to the STT service when there’s actual speech, and it enables interruption detection.
- Buffer clearing on interruption. When
clear_buffers_on_interruption=True, if the user starts speaking while the agent is mid-response, any queued audio/video that hasn’t been sent yet is discarded. This makes the agent stop talking immediately rather than finishing its current sentence.
Session Lifecycle
Unlike lower-level approaches where you manually manage connect/disconnect calls, the official transport integrates its lifecycle with Pipecat’s PipelineRunner. You just call runner.run(task) and the transport handles connecting to the session, publishing, subscribing to streams, and disconnecting when the pipeline ends:
runner = PipelineRunner()
await runner.run(task)
Event Handlers
The transport exposes session events that let you react to participants joining or leaving:
@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, data):
# Greet when the first participant connects
await asyncio.sleep(2.0)
await task.queue_frames([TTSSpeakFrame("Hello! What can I help you with today?")])
Available events include on_joined, on_left, on_first_participant_joined, on_participant_joined, on_participant_left, on_client_connected, and on_client_disconnected. These give you hooks into the full session lifecycle without needing to manage raw SDK callbacks.
Wiring Up AI Services: STT, LLM, TTS, and Avatar Generation
With the transport in place, configuring the AI services is straightforward. Each one plugs into the pipeline as a processor that consumes frames from the previous stage and produces frames for the next:
# Speech-to-Text
stt = DeepgramSTTService(api_key=os.environ["DEEPGRAM_API_KEY"])
# LLM (AWS Bedrock Claude)
llm = AWSBedrockLLMService(
aws_region=os.environ.get("AWS_REGION", "us-east-1"),
settings=AWSBedrockLLMService.Settings(
model=os.environ.get("BEDROCK_MODEL_ID", "us.anthropic.claude-haiku-4-5-20251001-v1:0"),
max_tokens=100,
),
)
# Text-to-Speech
tts = ElevenLabsTTSService(
api_key=os.environ["ELEVENLABS_API_KEY"],
settings=ElevenLabsTTSService.Settings(voice=os.environ["ELEVENLABS_VOICE_ID"]),
)
# Lip-synced Avatar
simli = SimliVideoService(
api_key=os.environ["SIMLI_API_KEY"],
face_id=os.environ["SIMLI_FACE_ID"],
)
A few things worth noting:
- Streaming, not batching. Pipecat doesn’t wait for the LLM to finish its entire response before starting TTS. As soon as the first tokens arrive, they flow into TTS service. As soon as audio chunks come back, they flow into the avatar rendering service. This pipelining keeps end-to-end latency tight.
- Interruption handling. If the user starts speaking while the agent is mid-response, Pipecat detects this and cancels the current generation. The agent stops talking and starts listening again. This is configured with a single parameter:
task = PipelineTask(
pipeline,
params=PipelineParams(allow_interruptions=True),
)
Context management. The LLMContextAggregatorPair keeps track of the conversation history, accumulating user messages and assistant responses so the LLM has full context for each turn.
# --- Context / Aggregators ---
context = LLMContext(
messages=[{"role": "system", "content": system_instruction}],
)
context_aggregator = LLMContextAggregatorPair(context)
Demo Time
The end result is an AI Video Agent ready to interact with customers based on an appropriate context and the right Voice AI system to support the use case. Here’s what the finished AI avatar video call looks like in action:
Extending Your Video AI Agent
The Vonage Video Transport for Pipecat makes it simple to drop an AI participant into any video session. The official transport handles all the complexity of WebRTC connectivity, native thread bridging, audio/video format conversion, and stream subscription while Pipecat manages the streaming, interruption, and service orchestration.
From here, you could extend this in several directions:
- Multi-agent sessions. Multiple AI participants with different roles in the same call
- Tool use. Give the agent function-calling capabilities to look up data, book appointments, or control external systems
- Vision input. Process the participant’s video feed so the agent can see what’s happening (screen sharing, product demos, whiteboarding)
- Multilingual support. Detect the participant’s language and respond in kind
The building blocks are modular. Swap STT providers, replace the LLM, or use a different avatar service. The pipeline structure stays the same.
Take Your Video AI Agent to Production
You’ve seen how the Vonage Video Connector SDK and Pipecat Transport come together to build a real-time AI participant. Production deployments involve more: scaling across sessions, handling edge cases in interruption and network conditions, and integrating with your existing systems.
As a Vonage Development Partner, WebRTC.ventures has deep, hands-on expertise with the Vonage Video API and the broader real-time communication stack. Whether you’re building AI Video Agents for live commerce, virtual tutors, customer support, or telehealth, our team can help you go from prototype to production. Contact us today and let’s take your Video AI Agent to production.

Further Reading:
- Voice AI Conversation Records: Why vCons Belong in Your Production Architecture
- What It Takes to Ship a Production AI Avatar System
- Don’t Mistake the AI Avatar for the Voice AI System Behind It
- Context Engineering Best Practices for Voice AI Agents
- Voicebot Platforms and Strategy for Non-Tech Teams
- Production Voice AI Architecture for Regulated Industries
- QA Testing for AI Voice Agents: A Real-Time Communication QA Framework
