OpenAI's Realtime API: AI Goes Multimodal

OpenAI’s new Realtime API represents a significant advancement in AI-powered communication, enabling speech-to-speech conversations with near-instantaneous processing time. The API delivers natural voice interactions through six distinct voice presets, integrating seamlessly with real-time applications.

A few months ago, we showed you how to build an AI-powered WebRTC application from scratch called Polybot.ai. Polly, as we like to call her, is a real-time language translation application using Large Language Models (LLMs). It allows business professionals, travelers, students, patients and anyone to get quick, accurate translations on the go, right in the browser.

In the short span of time since we built this app, the technological landscape has evolved considerably. The introduction of OpenAI’s Realtime API brings new capabilities to this space, particularly in optimizing the interaction between AI models and real-time communication systems.

In this post, we will take a look at the newly released OpenAI Realtime API, the challenges it solves, and how we used it to “go multimodal” with Polybot.ai, pairing it with a Large Multimodal Models (LMMs) rather than a LLM.

Why Multimodal?

LLMs allow developers to add innovative features that enrich the user’s experience of real-time communication applications. These include things like in-call assistance, post-call summaries and insights, and as in today’s example: real-time translations. However, implementing such features also introduces its own set of challenges.

When integrating LLM-based features, developers need to perform the following steps:

Obtain audio streams from the device’s microphone.
Process the audio streams using a Speech-To-Text (STT) service like Amazon Transcribe or Symbl.ai Streaming API to get audio transcripts.
Use audio transcripts to build a prompt for the LLM.
Send the prompt to the LLM, and process its response using a Text-To-Speech service like Amazon Polly or Deepgram before returning it to the user as audio.

This requires both inputs and outputs to go through multiple hops, adding complexity to the application and impacting the time users have to wait (a.k.a. latency) for a response.

Large Multimodal Models (LMMs) solve this issue by allowing developers to directly ingest audio streams into the AI model and obtain an audio response. This allows us to skip STT and TTS intermediate steps altogether.

How Does the Realtime API Work?

OpenAI has made available its Realtime API. This API is designed to enable low-latency and multimodal conversational experiences by integrating text, audio, and function calling capabilities, powered by the GPT-4o LMM.

It operates over a WebSocket connection to facilitate two-way communication between clients (users) and OpenAI servers in real-time.

Here’s how it works:

Connection Establishment: The client establishes a persistent WebSocket connection with the server, enabling bi-directional data transfer without needing to reestablish connections for each request/response exchange. This connection is known as a session.

Message Handling: Both clients and servers can send messages through this established connection via event handlers like ‘conversation.updated‘ for receiving new conversation items or ‘input_audio_buffer.append‘ for streaming audio to it. These messages are typically formatted in JSON format, containing information about the event type and related data (such as text inputs, audio files, etc.).

Event Types: The API supports various types of conversational events such as user input messages, server responses, function calls, and their outputs. For example, when a client sends an ‘response.create’ message to the server, it triggers a response event with generated speech or text content from the server.

Audio Processing: The Realtime API supports both audio input and output functionalities. Clients can send their speech inputs to the server as ‘input_audio’ messages, and servers respond with generated text or audio outputs. Audio processing includes automatic speech recognition (ASR) on client-side input and text-to-speech synthesis (TTS) on server-generated responses.

Real-time Interactions: The API operates in real time, ensuring that conversation events are handled as they occur, maintaining low latency. This is crucial for providing a seamless user experience during interactive conversations and avoids any noticeable delays or disruptions.

Server Side VAD (Voice Activity Detection): The Realtime API can also operate in server-side Voice Activity Detection mode, where it detects when the client has begun or stopped talking.

Interruption Handling: The API supports interruptions during real-time conversations. If a conversation is interrupted, either due to server or client actions, the API can truncate the ongoing response at an appropriate point and resume from there once communication is reestablished. This enables seamless continuation of conversations even when they are momentarily halted.

Function Calls: The Realtime API allows clients to invoke various functions during a conversation, which can include tasks like retrieving information, making decisions or performing complex computations. These function calls return outputs in real-time as part of the ongoing dialogue between the client and server.

To ease development of products using this API, OpenAI published a reference client library that shows how to interact with it. Additionally, there is a set of helpers libraries for managing media in the browser. We will use both these in the examples in this post, but note that these are in beta and might not be suitable for production applications.

Configuring Session and Server Events Listener

So let’s start our journey into LMM-based features by migrating our Polybot application from using a LLM for its real-time translation capabilities to leverage the Realtime API instead.

Initializing the Client

The first step is to initialize a RealtimeClient which receives the OpenAI secret key. For demo purposes we are doing this in the browser, so we need to also pass the dangerouslyAllowAPIKeyINBrowser key set to true. In production applications this should be done server side.

client = new RealtimeClient({
  apiKey: openAISecret,
  // we need this for the library to work in the browser
  // for production application this should be done on the server
  dangerouslyAllowAPIKeyInBrowser: true
})

Configuring the Session

Sessions are initialized with a default set of values, but we can configure these using the client’s updateSession Method. For Polybot we are interested in adding a custom instruction that tells the model to translate users input, and also set the turn detection to “server_vad” for allowing the API to automatically detect when the user has stopped talking.

client.updateSession({ instructions });
client.updateSession({ turn_detection: { type: "server_vad" }});

Configuring Server Events Listener

Another important thing to configure are listeners for server events. Particularly, you want to handle updates to the conversation, through the ‘conversation.updated’ event, as this is where you’ll get responses from the API.

We do this as follows:

Step 1: Event Listener. Start by listening to the ‘conversation.updated’ event on the client. When an update occurs, it triggers the provided async callback function.

client.on('conversation.updated', async ({item, delta}) => {
  // ...
});

Step 2: Fetching Updated Items. Now retrieve the most recent conversation items from the client. We will use this to update any local record that the application has for the conversation

const _items = client.conversation.getItems();

Step 3: Processing Audio Updates. If there’s an audio update (delta?.audio), add the 16-bit PCM audio data to a wavStreamPlayer with the item ID as the key. This is one of the helper functions mentioned before for managing audio streams in the browser.

if (delta?.audio) {
  wavStreamPlayer.add16BitPCM(delta.audio, item.id);
}

Step 4. Processing Completed Items with Audio. If the updated item is marked as ‘completed’ and contains audio data (item.status === ‘completed’ && item.formatted.audio.length), it performs these actions:

4a. If there’s a transcript for assistant responses (item.role === ‘assistant’) and it’s not empty, it displays captions using displayCaptions(). This function simply displays the transcript of the response in the UI.

if (item.formatted.transcript && item.role === 'assistant')
  displayCaptions(item.formatted.transcript)

4b. It decodes the audio data into a WAV file using the WavRecorder.decode() function with sample rates of 24,000 Hz. The decoded WAV file is then assigned to the item’s formatted data as item.formatted.file.

const wavFile = await WavRecorder.decode(item.formatted.audio, 24000, 24000);
item.formatted.file = wavFile;

The complete code of the event listener is shown below:

client.on('conversation.updated', async ({item, delta}) => {
  const _items = client.conversation.getItems();
  if (delta?.audio) {
    wavStreamPlayer.add16BitPCM(delta.audio, item.id);
  }
  if (item.status === 'completed' && item.formatted.audio.length) {
    if (item.formatted.transcript && item.role === 'assistant')
      displayCaptions(item.formatted.transcript)
    const wavFile = await WavRecorder.decode(
      item.formatted.audio,
      24000,
      24000
    );
    item.formatted.file = wavFile;
  }
  // we update local records of items
  items = _items
});

Connecting to Sessions

When everything is configured, and events are being managed correctly, it’s time to connect to the session and start sending media to the API.

await wavRecorder.begin();
await wavStreamPlayer.connect();
await client.connect();

Next, we start listening for audio inputs using the record method from the WavRecorder helper library, and then we send such audio to the API using the appendInputAudio method.

if (client.getTurnDetectionType() === 'server_vad') {
  await wavRecorder.record(data => client.appendInputAudio(data.mono));
}

Demo Time!

With all the pieces in place, let’s see Polybot translation capabilities now powered by a LMM.

Ready to Unlock Real-Time Multimodal Interactions?

In this example, we’ve demonstrated how to leverage OpenAI’s Realtime API to create a real-time multimodal conversation system for Polybot, our real-time translation application. This seamless integration enables users to communicate in their native language while receiving instant translations, all powered by an straightforward logic under the hood where inputs and outputs flow directly to and from the application and the AI model.

This approach allows us to build more natural and intuitive interfaces that facilitate deeper connections between people from diverse backgrounds. The future of conversation has arrived – one that is conversing without borders!

Are you looking to revolutionize the way users interact with your AI-powered real-time communication app? Our team of experts can guide you in incorporating LLM-based approaches like OpenAI’s Realtime API into your application. By partnering with us, you’ll be able to unlock the full potential of multimodal conversations and provide a competitive edge in the market. Contact us today and let’s make it live!

Why Multimodal?

How Does the Realtime API Work?

Configuring Session and Server Events Listener

Initializing the Client

Configuring the Session

Configuring Server Events Listener

Connecting to Sessions

Demo Time!

Ready to Unlock Real-Time Multimodal Interactions?

Real-Time Speech Transcription on Android with SpeechRecognizer

Building LiveCart: An AI-Powered Live Selling Solution

Watch WebRTC Live #100: Building Interactive Virtual Teammates with AVA Intellect

Voice + Action: The Convergence of WebRTC, Conversational AI, and Agentic Systems

Recent Blog Posts

Real-Time Speech Transcription on Android with SpeechRecognizer

Building LiveCart: An AI-Powered Live Selling Solution

Watch WebRTC Live #100: Building Interactive Virtual Teammates with AVA Intellect

Voice + Action: The Convergence of WebRTC, Conversational AI, and Agentic Systems

We’re one of the few agencies in the world dedicated to WebRTC development. This dedication and experience is why so many people trust us to help bring live video application dreams to life.

Let's get started!

Contact us today

Join our mailing list!

Categories

Why Multimodal?

How Does the Realtime API Work?

Configuring Session and Server Events Listener

Initializing the Client

Configuring the Session

Configuring Server Events Listener

Connecting to Sessions

Demo Time!

Ready to Unlock Real-Time Multimodal Interactions?

Recent Blog Posts

Recent Blog Posts

We’re one of the few agencies in the world dedicated to WebRTC development. This dedication and experience is why so many people trust us to help bring live video application dreams to life.