OpenAI has introduced a new feature to its Realtime API: a WebRTC endpoint that enables real-time interaction with AI models. This exciting development opens up new possibilities for application builders and users alike, who can now enjoy seamless, instantaneous conversations with AI-powered systems.
In this post, we explore why OpenAI this new endpoint is important, how it brings together the formally parallel technologies of WebRTC and AI, and the steps to get started.
From WebSockets to WebRTC
The OpenAI Realtime API enables developers to build voice-interactive applications with the GPT4o and GPT4o-mini models directly, without intermediate steps. Unlike traditional methods that rely on separate Speech-to-Text and Text-to-Speech models, this API allows the target models to handle voice inputs and outputs natively.
This introduces multiple benefits, including lower latency for getting responses and the possibility to understand vocal nuances like intonation, pitch, emotion, and speaking style, which are usually lost in traditional text-based interactions.
Initially, OpenAI made this API available through a Websockets interface. This allowed any application capable of establishing this type of connection to be able to interact with it. In this approach, applications and the API exchange messages that include audio chunks of the input and output. Each application is responsible for managing the response chunks and playing these to the end user.
This is a great approach for server side applications. But for browsers, which already bundle support for media streams through the WebRTC API, it isn’t an optimal way to interact with it. Also, authenticating using permanent credentials -as it was required by such an interface initially- in client side code is not a good idea.
The new WebRTC endpoint provides browsers -and any other application or device that bundles a WebRTC implementation- the ability to interact with the Realtime API in a straightforward fashion taking advantage of the already-included networking and media management capabilities of WebRTC.
Filling the Gaps Between WebRTC and AI
While AI has been the boom of the last couple of years, at the beginning it wasn’t much related to real-time communication and WebRTC. This changed when we started to see AI models supporting voicebots that make and answer calls, avatars that join meetings on our behalf, and note-taking bots that summarize these and provide advanced insights.
Yet even as more use cases converging AI and real-time communication capabilities emerged, they worked side by side but not blended. WebRTC enables video and audio interactions, but interacting with models such as GPT or Claude was available in text-only channels, so developers had to implement intermediate transformation steps that added latency and complexity.
The rise of Large Multimodal Models (LMMs) and the release of OpenAI Realtime API was a first step to achieve a more optimal approach. The introduction of this WebRTC endpoint really fills the gaps between these two technologies. It brings interaction into a common medium, giving as a result more seamless and lower latency integrations of these.
For further reading: Real Time Voice AI: OpenAI vs. Open Source Solutions
Getting Started with the WebRTC Endpoint
Interacting with the WebRTC endpoint is a straightforward process: all you need to do is to create a regular RTCPeerConnection, and generate an SDP Offer to exchange with the API. Right now the signaling channel is a REST API endpoint where you send such an offer in order to get the SDP answer.
For authentication, you can generate ephemeral keys in your backend or server-side code, that you can later use in the browser to interact with the API. These keys are valid for up to 1 minute and are only used to establish the initial connection.
As of writing this post, the API doesn’t seem to support, or at least to use, ICE servers so it might not work on some restrictive networks, and might not handle changes in network conditions properly. It also seems to lack support for trickle ICE. Stay tuned for an upcoming post giving a deeper look into the implementation, as we expect these concerns to be addressed as the API evolves.
Summarizing, the process for establishing a WebRTC connection with the OpenAI Realtime API goes like this:
- Generate an ephemeral key on your backend.
- Create an RTCPeerConnection and generate an SDP Offer.
- Send the Offer to the OpenAI servers to receive the SDP Answer.
- Add the Answer to the peer connection.
- Enjoy!
// server.js
// 1. Generate an ephemeral key on your backend
app.get("/session", async (req, res) => {
const r = await fetch("https://api.openai.com/v1/realtime/sessions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "gpt-4o-realtime-preview-2024-12-17",
voice: "verse",
}),
});
const data = await r.json();
// Send back the JSON we received from the OpenAI REST API
res.send(data);
});
// client.js
// 2. Create an RTCPeerConnection and generate an SDP Offer
const pc = new RTCPeerConnection();
pc.ontrack = () => {
// add the logic to set the audio track to your application
}
pc.addTrack(/* add the local tracks to the peer connections */)
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
// 3. Send the Offer to the OpenAI servers to receive the SDP Answer
const tokenResponse = await fetch("/session");
const data = await tokenResponse.json();
const EPHEMERAL_KEY = data.client_secret.value;
const baseUrl = "https://api.openai.com/v1/realtime";
const model = "gpt-4o-realtime-preview-2024-12-17";
const sdpResponse = await fetch(`${baseUrl}?model=${model}`, {
method: "POST",
body: offer.sdp,
headers: {
Authorization: `Bearer ${EPHEMERAL_KEY}`,
"Content-Type": "application/sdp"
},
});
// 4. Add the Answer to the peer connection
const answer = {
type: "answer",
sdp: await sdpResponse.text(),
};
await pc.setRemoteDescription(answer);
Enabling Seamless AI-based Voice-to-Voice Interactions Using WebRTC – Ready to bring your vision of real-time AI interactions to life?
In summary, OpenAI’s new WebRTC endpoint for the Realtime API has opened up exciting possibilities for real-time AI interactions. By leveraging this technology, developers can build applications that seamlessly integrate with AI models like GPT-4 with less latency and complexity, enabling instant, effortless communication.
With OpenAI’s new WebRTC endpoint for the Realtime API, you’re just one step away from creating seamless, instantaneous conversations with AI-powered systems. To help you turn this vision into a reality, consider partnering with WebRTC.ventures – experts in building custom WebRTC solutions that integrate perfectly with AI systems like GPT-4. Let our team of experienced developers guide you through the process and ensure your application is built to meet the highest standards of performance, security, and scalability. Contact Us today, and Let’s Make it Live!