Latency is a crucial factor in real-time communication applications like video conferencing, online gaming, and live streaming. Even small delays can disrupt the flow of communication or interaction, leading to frustration and a poor user experience. As more interactive and immersive technologies continue to evolve, managing latency becomes even more important to maintain the seamlessness and responsiveness users expect. Keeping latency low is key to ensuring these experiences feel natural and engaging.

In this post, we will delve into the concept of latency in real-time communication applications, explore its various factors and components, and discuss strategies for reducing and managing it effectively.

Understanding Latency and Latency Values

In general terms, latency is the time you wait for something to happen after an initial action. For instance, in music production it is the time it passes between a musician playing a note and the sound coming from the computer’s speaker; similarly, in online videogames it is the time it passes between pressing a button in the controller and the character moving on the screen. 

In the context of real-time communication applications, latency is the time it takes for data to travel from one user to another and back, essentially determining how long users wait to receive a response. Lower latency ensures that communication feels immediate and seamless, while higher latency can cause noticeable delays, leading to a lag in conversations or interactions.

For real-time communication applications, latency below 500 milliseconds is crucial to avoid interruptions and maintain a smooth, natural flow. Anything beyond that can make conversations awkward or impair user experience, as it introduces noticeable gaps between speaking and hearing or between action and response.

There are multiple processes that contribute latency value. These steps happen in microseconds or milliseconds, but combined, they start to add up.

  • Acquiring media data from microphones and cameras
  • Encoding and decoding media
  • Processing media 
  • Transmitting data through the network

There are other factors that influence latency, both directly and indirectly, such as:

  • Network conditions
  • Geographical locations of the peers and servers
  • The encoders and decoders (codecs) that are being used
  • Device and hardware limitations

Acceptable Latency Varies By Use Case

For use cases like cloud gaming or live video auctions, low-latency is absolutely critical. In these scenarios, even milliseconds of delay can significantly impact the user’s experience, leading to missed opportunities or lag that can make the experience frustrating or unplayable.

While latencies under 500 ms may be considered acceptable in some applications, they are often too high for the demands of the use cases mentioned above, where near-instantaneous responses are needed. For these use cases, latency closer to 20-50 ms is generally considered ideal, and anything above 100 ms is likely to feel sluggish or unresponsive.

In contrast, there are communication scenarios where users don’t need a sub-second reaction. These include:

  • Chat
  • Video-on-demand (VOD) streaming
  • Non-interactive streaming/broadcasting

How we manage latency depends on the target performance requirements and the thresholds needed for a particular use case.

How Generative AI Increases Latency

Generative AI refers to a group of AI models designed to create new content based on given instructions. At the forefront of this technology are Large Language Models (LLMs), which process natural language inputs to generate text responses or interact with external systems. LLMs are rapidly becoming essential in real-time communication applications.

AI-based applications like these require the implementation of additional steps into the media pipeline such as the ones listed below. They all have a direct impact on latency. 

  1. Original audio stream goes to an Automatic Speech Recognition (ASR) system that transcribes the speech into text.
  2. The transcript goes into one or more AI models which generate an appropriate response.
  3. The response goes into a Text-to-Speech (TTS) system that utters it to the user.

As we continue discussing latency, it’s crucial to consider how these additional processes impact our latency goals and take steps to minimize their effects.

Monitoring Latency

Implementing efficient monitoring gives you visibility over the state of your application and lets you know exactly what needs to be optimized.

Among the metrics you need to measure are:

  • Round Trip Time (RTT): Measures the time taken for a packet to travel from source to destination and back. It gives a pretty good estimation of the total latency
  • Bitrate: Measures the amount of data transmitted per second.
  • Jitter: Represents the variation in packet arrival time.
  • Packet Loss: Tracks the percentage of lost data packets.

It’s also helpful to add benchmarks on key processes in the code, such as any custom transformation pipeline or interactions with other application components. The goal is to know what is adding latency to your application, and improve it.

Reducing Latency for Real-Time Communication

After setting a realistic latency target for your use case, the next step is to design an effective strategy—or combination of strategies—for managing it.

Keep Your Infrastructure Close

The most obvious countermeasure for reducing latency is to shorten the distance between peers. This means positioning media and ICE servers as close as possible to the users and to each other.

The same principle applies to use cases involving interaction with AI models. When possible, it’s beneficial to have all required components—such as ASR systems, LLMs, and TTS services—within the same network, or even running on the same device, to minimize the latency caused by transporting data between different points.

Optimize Processes

After identifying the latency bottlenecks in your application, the next step is to optimize these processes as much as possible.

For instance, let’s say your application implements a custom WebRTC connection setup with multiple video streams using an approach similar to the one shown in code below.

async joinRoom(roomId, participants) {
  await this.initializeLocalStream();

  // sequentially establish peer connections
  for (const participantId of participants) {
    await this.connectToPeer(participantId);
  }
}

After a thoughtful analysis, you determine that this approach establishes connections sequentially, which adds a considerable amount of latency when users first join a call.

Instead, you can implement a parallel approach using Promise.all, as follows:

async joinRoom(roomId, participants) {
  await this.initializeLocalStream();

  // parallel connection establishment using Promise.all
  const connectionPromises = participants.map(this.connectToPeer.bind(this));
  await Promise.all(connectionPromises);

  this.startNetworkMonitoring();
}

When integrating GenAI into your media pipeline, it’s crucial to consider the LLM’s token-generation rate, which reflects how quickly the model can generate responses. While technological advances are reducing response times, they remain relatively high. This poses a challenge, especially when compared to natural face-to-face conversations. 

A study on Timing in Conversation shows that speakers typically start responding within 200 milliseconds after another speaker finishes. Delays longer than this can make interactions feel less smooth or synchronized, highlighting the need for additional strategies to manage response lag in AI-driven communication systems.

One effective approach to mitigate this is to stream the LLM’s response as it’s being generated. This allows users to start receiving feedback without having to wait for the entire response, improving the fluidity of interactions and reducing the impact of response lag in AI-driven communication systems.

Pick Latency-Efficient Tools

In addition to optimizing the processes in your application, you also need to select the right tools for it. For instance, if your application requirements allows it, you can force the use of latency-efficient codecs such as Opus and VP8, for audio and video respectively.

In the AI scenario, you can also choose tools optimized for low-latency scenarios. For example, using faster variants of the Whisper STT model can yield better results than the original

In the same vein, using a LLM with fewer parameters or optimized for speed, like Llama 3.1 8b or Claude 3 Haiku, will give lower inference times.

Some providers use their own inference-optimized chips to offer high token-generation rates with general-purpose models. Examples include Groq, Together and Fireworks.

Prioritize Latency Management as a Critical Aspect of Application Development

We have seen here that managing latency is essential for delivering a smooth and satisfying user experience in real-time communication applications. By understanding the factors that contribute to latency, optimizing processes and infrastructure, and selecting the right tools and technologies, developers can significantly reduce latency and improve overall performance.

As we continue to push the boundaries of what is possible with real-time communication, it’s essential to prioritize latency management as a critical aspect of application development. By doing so, we can unlock new possibilities for interactive and immersive experiences that captivate users worldwide.

Ready to overcome latency challenges and deliver exceptional real-time experiences? Let WebRTC.ventures help! Our expert team can optimize your infrastructure, select the right technologies, and implement effective latency management strategies. Contact us today and Let’s Make it Live!

Recent Blog Posts