Voice AI systems generate more than recordings and transcripts. Every production interaction produces a web of artifacts across multiple systems: call-setup metadata, ASR output, LLM responses, tool calls, CRM updates, escalation events, and compliance-relevant signals like caller identity verification. Most Voice AI architectures store some of these. Few store all of them in a way that survives an audit, a customer dispute, or a platform migration.
That gap has a name and an emerging solution. The vCon standard, currently under active development at the IETF, defines a structured, portable container for conversation data. It gives voice AI and video communications teams a format that can hold the full record of an interaction, not just the transcript, and carry it across systems, trust boundaries, and time.
Voice AI Conversation Records: What Production Systems Actually Capture
Consider a specific scenario: A financial services customer disputes an automated outbound call. Your compliance team needs to answer several questions: What was said? Who initiated the call? Was the caller identity authenticated? What attestation level did STIR/SHAKEN assign? Was the call escalated? What tools or data sources did the AI reference during the interaction?
A transcript answers the first question. It does not answer the rest.
This is the core records problem in production Voice AI. The conversation itself is only one layer. A complete record also needs to capture:
- Who participated, including identity verification signals, not just names or phone numbers
- How the call was established, including SIP signaling metadata and STIR/SHAKEN verification results
- What systems processed the interaction, including ASR providers, LLM orchestration layers, and tool calls
- What analysis was generated, including summaries, classifications, sentiment scores, and AI outputs
- What happened operationally, including escalations, hold events, transfers, and CRM updates
- The audit chain, including timestamps, system identifiers, and signed or encrypted record versions
When these artifacts live in separate systems: an SBC log here, a transcript in an ASR vendor dashboard, a summary in a CRM, a STIR/SHAKEN result in a carrier trace. Reconstructing the full record after the fact becomes a manual, error-prone process. In regulated industries, that is a governance problem, not just an operational inconvenience.
What Is vCon? The IETF Conversation Container Standard
A vCon is a JSON-based conversation container developed under the IETF Virtualized Conversations working group. It is best understood as the conversation equivalent of a vCard: a portable, structured format designed to carry conversation data across systems without losing context.

The core vCon format defines several object types:
- Parties. who participated in the conversation, with support for identity parameters including STIR/SHAKEN data
- Dialog. the actual conversation content, with references to audio, video, or text
- Attachments. related artifacts including SIP messages, certificate chains, verification reports, and supporting files
- Analysis. transcripts, summaries, classifications, sentiment, and AI-generated outputs
- Metadata. timing, identifiers, system context, and audit information
The design principle is separation of concerns. Audio is not the same artifact as a transcript. A transcript is not the same artifact as an identity verification result. A SIP trace is not the same artifact as a compliance report. vCon gives each of these a defined place in one container, rather than forcing everything into a product-specific schema or scattering it across vendor storage.
One architectural detail matters for production pipelines: vCons are designed to evolve over time. Different components of the record can be produced by different systems at different stages. Signed versions become immutable. When additional content needs to be added later, a new vCon references the earlier signed version rather than overwriting it. That versioning model fits real production pipelines, where telephony infrastructure, ASR, LLM orchestration, and post-call analysis all run in separate services on different timelines.
Why Conversational Data Gets Fragmented Across Systems
Modern Voice AI deployments are not single systems. A typical production pipeline involves:
- SIP or PSTN infrastructure for call origination and routing
- WebRTC media paths for browser or app-based voice
- ASR services for speech-to-text
- LLM orchestration for reasoning, response generation, and tool use
- TTS services for voice synthesis
- Tool calls to external APIs, databases, or CRMs
- Human escalation paths and agent handoff logic
- Analytics, QA, and evaluation pipelines
- Compliance and retention workflows
Each of these layers generates data about the conversation. Without a common record format, that data accumulates in vendor dashboards, application databases, log aggregators, media storage buckets, and temporary observability pipelines. Much of it has a short retention window by default.
The downstream effect is that basic production questions become difficult to answer reliably:
- What exactly happened in this interaction, end to end?
- What did the AI say, and what data or tools influenced that response?
- Was the caller identity verified, and at what attestation level?
- Was the interaction escalated, and when?
- Can we reconstruct the full record six months from now?
- Can a different system consume this record without losing context?
vCon addresses this by providing a single container that can hold or reference all of these artifacts in a defined, portable structure. The pipeline complexity does not disappear, but the record of what happened becomes coherent rather than fragmented.
vCon SIP Signaling and STIR/SHAKEN: What the New IETF Draft Adds
A new IETF Internet-Draft published in April 2026 extends vCon specifically for SIP signaling and STIR/SHAKEN data. This extension is worth attention for any team operating telephony infrastructure.
The problem it addresses is specific: most Voice systems capture what was said but discard the evidence about how the call was established. SIP signaling data — including the Call-ID, INVITE and response metadata, and STIR/SHAKEN verification results — typically lives in SBC logs or carrier traces with short retention windows. Once that data ages out, the transcript becomes the only surviving record of the call.
For regulated or high-risk workflows, that matters. STIR/SHAKEN has been required in the IP portions of U.S. voice networks since June 2021. Caller authentication data is now part of mainstream telephony infrastructure. For Voice AI, this is especially relevant to outbound customer interactions, collections, financial services, healthcare, or other workflows where caller identity, consent, traceback, or call legitimacy may later be questioned.The SIP sign aling extension distributes call-setup data across existing vCon objects:
- Party objects can carry sip_contact, sip_user_agent, and sip_display_name
- Dialog objects can carry sip_call_id and related fields
- Attachment objects can store SIP messages, certificate chains, and STIR/SHAKEN verification reports
| Record Type | What You Preserve | What You Lose |
| Voice AI without vCon + SIP | Audio, transcript, summary, agent output | SIP Call-ID, INVITE metadata, PASSporT, attestation level, certificate chain, verification result |
| Voice AI with vCon + SIP | Audio, transcript, analysis, plus SIP and STIR/SHAKEN evidence in one portable container | Less fragmented evidence, simpler audit reconstruction |
Voice AI Compliance Architecture: What the Record Layer Needs
vCon defines the record format. It still needs to sit inside a production-grade architecture. Teams building toward this should also plan for:
- Object storage for media files with defined retention and cleanup policies
- Managed databases for application and session state
- Authentication, authorization, and tenant-aware access controls
- Signed or encrypted vCon records for tamper-evident audit trails
- Retention policies that satisfy regulatory requirements by vertical
- AI evaluation pipelines that can consume structured vCon data
- Monitoring across media, AI, and application layers
The record format and the surrounding architecture are separate problems. vCon makes the record portable and structured. The architecture around it determines whether that record is actually preserved, secured, and accessible when it matters.
Working With WebRTC.ventures on Production Voice AI Architecture
WebRTC.ventures builds production Voice AI systems for regulated and high-volume customer workflows across telehealth, fintech, contact center, and CPaaS platforms. Our work includes real-time media architecture, SIP and WebRTC integration, LLM orchestration, observability design, compliance-aware record keeping, and long-term production support.
The record layer is part of how we design systems from the start, not something we add after a compliance review surfaces a gap. If your team is moving a Voice AI system from prototype to production and wants architecture that holds up under audit, dispute resolution, and long-term operations, we would like to hear about what you are building!
Further Reading:
- Production Voice AI Architecture for Regulated Industries
- Layered AI Guardrails for Enterprise AI Agents (AgilityFeat)
- Bedrock vs Vertex vs LiveKit vs Pipecat: Choosing a Voice AI Agent Production Framework
- Connect Any PSTN Phone Number to a SignalWire Voice AI Agent via SIP Forwarding
- WebRTC SIP Integration: Advanced Techniques for Real-Time Web and Telephony Communication
- Building a Voice AI Agent with Policy Guardrails Using Twilio, Pipecat, and LangGraph
- How to Build Voice AI Applications: A Complete Developer Guide
