Media over QUIC is gaining momentum, AI in video is definitely here but not quite ready for real-time, and the WebRTC community continues to be vibrant. Those are just a few of my takeaways from the excellent RTC.ON conference that I attended in Krakow, Poland last week. 

In the video below, I share some sights and sounds from the conference. Keep scrolling after that, as this post will go into more detail about the conference and include links relevant to many of the talks.

A video summary of the RTC.on conference

Upon arriving in Krakow, my first job was to head straight to test out the broadcast studio set up at Software Mansion, the hosts of RTC.ON. This was to advance the special episode of WebRTC Live that we planned for September 12. In this episode, which you can see below, I interviewed several of the conference speakers to get a preview of their talks. Our own Alfred Gonzalez Trastoy from WebRTC.ventures previewed his talk about building live captioning and translation into a WebRTC video application. Next, Violina Popova from ClipMyHorse.tv previewed her talk about using WebSockets in a React Native application for live streaming equestrian events. Finally, I spoke with Mateusz Front from Software Mansion about his talk on the Membrane Framework, which is an open source multimedia framework he maintains based on Elixir.

You can see the full conversation in the next video. After that, I’ll dive more into some of the other speakers at the conference.

WebRTC Live – Live from Krakow, Poland!

Now, let’s get into some of the speakers! Please note that I didn’t get to see all the talks, so my apologies to those I don’t include in my summary below … You can see all the talks in the conference agenda and I believe they will eventually publish the videos for all the videos on the RTC.ON 2024 YouTube playlist.

Chad Hart: “WebRTC Developer Dynamics: An open source analysis”

Chad Hart from webrtcHacks and RingCentral started off the conference with an overview of the health of WebRTC, based on analysis of public sources like github projects and stack overflow questions about WebRTC. Chad noted that the pandemic introduced a lot of people to WebRTC and so of course there was a natural spike. But many of those who have stuck around after the pandemic are actually producing a lot of code, leaving the community actually really healthy in 2024. He’ll write up a thorough analysis soon. If you’re not already on his list, I definitely recommend it. (Chad was also my guest on WebRTC Live in July to discuss “Tools for WebRTC Hacks”.)

Based on his analysis of code committed publicly to github, WebCodecs hasn’t taken off yet. I assume that’s because the extra work necessary to modify the media pipeline is only useful in certain niche cases. For most people, the WebRTC pipeline will remain very effective. Chad noted there is a growing interest in Media over QUIC and WebTransport, but nothing is on pace to replace the more standard WebRTC implementations at this point.

Obviously this is something our team at WebRTC.ventures will watch closely. We are happy to work with “unbundled WebRTC” technologies like WebCodecs and WebTransport when it makes sense, though we continue to see more typical WebRTC implementations as the best option for most use cases. 

Lorenzo Miniero – “WebRTC and QUIC: How hard can it be?”

Lorenzo was the first of two speakers to talk in more detail about Media over QUIC (also referred to as MoQ or MoQT, and pronounced like “mock”). Lorenzo talked about his experiments with Media over QUIC in his Janus media server, and the benefits it can offer of a more flexible and lower latency way of sending media. 

MoQ assumes everything transmits well since it’s on top of newer QUIC protocol, which should handle things like congestion control and retransmits for you. But the details are not all clear, and Lorenzo noted this is a big area he is still learning about. 

Because the standards are not complete, it’s not certain how all of this will work out. Lorenzo does feel that MoQ and WebRTC will coexist. There will be some overlap, but also different attributes such that they don’t have to replace each other.

See Lorenzo’s talk here

Wojciech Jasiński – “On challenges and considerations for real time AI processing”

Wojciech is from the Software Mansion team. He gave an interesting talk about their experiments with AI in real-time video. He shared the results of their experiments with things like removing beer bottles from live video streams. 

This is an area of AI technology with more challenges than solutions right now. I appreciated how transparent Wojciech was about what they were successful at doing and also what challenges remain. 

The most interesting area to me was when he discussed how “Video Inpainting” is much harder in real-time video. Also known as Video Completion, this is how AI can remove an object from a video and fill in missing areas of the video that are “both spatially and temporarily coherent.” This is much easier to implement in static videos because the AI has access to video frames before, during, and after the object being removed, i.e., the beer bottle in the presentation’s example. However, in real-time video, not only are you trying to do this very quickly in real-time, and probably with edge computing, you also don’t have access to as many replacement video frames. Because it’s a live video, there are no future frames! 

See Wojciech’s talk here

Rob Pickering – “What happens when AI starts grocking streaming audio directly”

Rob was one of a couple of speakers to talk about the latest developments in conversational AI and grokking streaming audio. This is a topic very interesting to our team at WebRTC because we work with conversational AI bots in our contact center work and our own Conectara implementation. 

Rob showed demos with his LLM voice work with ApliSay, which provides Conversational AI for telephone conversations. He made many good points about taking turns in conversations, which is done very naturally in human conversation based on the visual and audio cues we give each other, as well as our own social etiquette – but is much harder to account for when part of the conversation is an AI bot. The conventional way to do that now is with text based LLMs, but that introduces complication since you have to transcribe the speakers in real time and send it to the text LLM in chunks. And then, appropriately handle the return of text chunks from the LLM and play them back as audio to the user. This not only creates latency in the system, but can easily create conversationally awkward moments.

Rob discussed how OpenAI launched multimodal abilities in 2023. This should mean that you can have the LLM process the human audio directly, and return audio directly, which should provide lower latency and higher quality conversations. This will be a big advance in LLMs and Conversational AI. However, Rob noted that while this technological shift is starting faster than he expected, these multimodal capabilities are not generally available yet as API and still require work before being used for Conversational AI in the enterprise.

Even once fully multimodal LLMs are publicly available via APIs, they will likely be more expensive than a text based LLM. So for simpler Conversational AI applications, the current method of chunking audio data and sending it to a text based LLM will remain the favored implementation for cost concerns.

Rob also recommended reading “The Bitter Lesson” post by Rich Sutton in 2019, which previews that Speech to Speech models are the way to go. Rob says they are almost here!

See Rob’s talk here

Damien Stolarz – “WebRTC and Spatial Computing on Apple Vision Pro”

Damien, the CEO of Evercast, talked about Apple Vision Pro for WebRTC and spatial computing. Evercast works with film production workflows for big Hollywood and gaming clients. 

Damien and his team of 30 engineers ported libWebRTC to Vision Pro. They had to rewrite parts of the audio and video to better accommodate their film/media use cases. Evercast runs on Vision Pro so that customers can do immersive video editing in the headset. This is important for seeing the video in better quality and larger layouts, like it would be on a movie screen. 

WebRTC is used for collaboration with other users in the editing workflow directly from the headset. This was a cool presentation and one of the best examples I’ve seen of the advantages of getting WebRTC working in a headset.

See Damien’s talk here

Dan Jenkins – “Taking ICEPerf.com to the next level”

Our friend Dan Jenkins from NimbleApe talked about his latest project, ICEPerf, which helps test different commercially available TURN networks. Since TURN is so important to getting a solid WebRTC connection in many corporate networks, this is a really great tool to help you compare providers. Dan’s done a nice job working directly with the different network providers to provide up to date and realistic performance numbers across different geographic locations. He has plans to continue to increase the utility of this service. Definitely go take a look at ICEPef.com!

See Dan’s talk here

Mateus Front – “Improving DX and adoption of Membrane Framework”

Mateus talked about Software Mansion’s Membrane framework for media streaming, which is based on managing media pipelines. Mateus is the maintainer of membrane and was also on the WebRTC Live broadcast that I linked to above. This was a really interesting framework to use as a basis for a media streaming project and they are building a good community around it.

See Mateus’ talk here

Piotr Skalski – “Everything you wanted to know about VLMs but were afraid to ask”

Piotr Skalski, a Computer Vision Engineer at Roboflow, talked about Video Large Language Models (VLMs). He showed how VLMs are computer vision models that now can do complicated analysis that once required multiple models. An example was identifying a car’s model, make and license plate number in a single model. This previously would have required 3-4 different trained models: one for each type of data you are identifying. 

My takeaway was that this works well on static images, but VLMs are not ready yet for real-time video. However, they are not far away I assume. This is definitely a space to watch!

See Piotr’s talk here

Violina Popova – “Real-Time Video Streaming with WebSockets in React Native”

Violina talked about the importance of real-time messaging in video streaming applications. Her presentation focused more on WebSockets and the messaging around the live video that their ClipMyHorse.tv application provides for equestrian events around the world. Because their application is used for gaming around the events, the messaging must be done with low latency to ensure that no one has an advantage based on the timing of the information they receive. They also use WebSockets for other signaling and synchronization around the live video events, such as detecting when the same user account has opened the live stream on multiple devices. Because they are a subscription service, that messaging will also be used to shut down the live stream on the person’s other devices, so that a single paid account cannot be shared between multiple viewers.

Violina is also cofounder of Front End Queens, a group supporting women in software development. Please check them out!

See Violina’s talk here

Boris Grozev – “Jitsi Videobridge: the state of the art SFU that powers Jitsi Meet”

Boris from 8×8 gave a presentation on the Jitsi SFU. One area I found particularly interesting was when he talked about performance management. They take advantage of the WebRTC bandwidth estimator, which uses jitter and packet loss to estimate connection strength. Based on that connection strength estimate, they can make optimizations for each user. For testing at scale, Boris mentioned they are using clusters of Selenium in about 200 vm’s to test 10k viewers and 50 active participants in a single call. This not only showed the power of Jitsi Meet, but provided inspiration for those testing other WebRTC applications at scale.

Boris mentioned using pagination of videos, which is a common UX technique in large calls so that you only show the videos of people speaking at any given point. It saves on bandwidth by never sending more video streams than the user can see at once. To further help in large calls where many users may have their microphones unmuted, Jitsi also only sends audio of the top three speakers at any given point. Lower ranking audio tracks are dropped, as well as any tracks that are silent. These practices are important when trying to scale large group calls and stream them out to thousands of other viewers using WebRTC.

Because users can paginate down to other videos, and because different users may start speaking and need to be broadcast, it’s important that hidden or dropped tracks are maintained by the SFU so they can be sent out quickly when needed. I believe Boris said these sorts of changes are made in less than 100ms, so users don’t notice.

Scaling techniques like this can be helpful in a variety of use cases that need larger group video calls, such as EdTech applications, interactive webinars, online conferences, and corporate presentations.

See Boris’ talk here

Enzo Piacenza – “Building a low-latency voice assistant leveraging Elixir and Membrane: Insights and challenges”

Enzo, a senior software engineer at Telnyx, was another to speak about conversational AI. Enzo talked about the use of Membrane at Telnyx in a low latency voice assistant. One interesting tidbit he shared is how they used a silero voice activity detection library to help filter out noise for users talking to the assistant in loud environments.

There was also some interesting discussion in the presentation and Q&A about handling interruptions between the human and the Conversational AI bot. Many people, like myself, will say things like “umm hmm ok” in conversations as the other person speaks. When you’re talking to a bot, this could confuse it and make it stop. Enzo noted that they have to continually learn to adjust the sensitivity of the chatbot to account for this, something he wants their customers to be able to do directly as well.

I had a chance to ask him more about this at the after party – a good reason to attend conferences in person! He said the tuning is all about the window of time that they consider sound in the human’s audio track before the bot will pause. I believe he said that it’s less than 100ms right now, which may be too sensitive since my “umm ok” would likely interrupt the bot. 

If you increase that window, it would require me to say more before the bot stops. This is good unless you increase it too much and then the human will get frustrated, feeling like they have to shout over the bot to get it to stop. You can see how it’s a fine line to balance. Enzo told me that if the bot is interrupted, it stops talking and remembers where it was. That way, if I don’t say more, the bot can just pick up where it left off. Or if I do say more, it at least has the context as to where the conversation ended. These are good practices to keep in mind if you are building your own Conversational AI implementation.

See Enzo’s talk here

Alfred Gonzalez – “Boosting Inclusivity: Closed captioning & translations in WebRTC”

As mentioned in my WebRTC Live interview above and in my conference video, I was happy to see our very own Alfred Gonzalez Trastoy speak about real time captioning and translations in WebRTC applications. Alfred spoke about general concepts and best practices based on our work with clients. He also gave a live demo of using speech and translation APIs to capture his own voice in both English and Spanish and show the appropriate translations in real time on the screen in Polish, English, and Spanish. This helps show the power of these rapidly advancing APIs to implement real time translation for many use cases such as contact centers, so that agents can work with customers across languages.

Alfred will be writing a blog post for our site soon with more details, so I’ll leave those to him. In the meantime, if you’re interested in this topic for contact centers specifically, then check out our Conectara application for cloud contact center solutions based on Amazon Connect, the Amazon Chime SDK, and the AWS cloud, and we can build it for you!

See Alfred’s talk here

A word on Elixir

The conference hosts from Software Mansion have a strong interest in using Elixir in their work. If that’s your tech stack of choice, make sure you check out all the talks from the conference, not just the ones I’ve gone into detail on here. Michał Śledź spoke about their “batteries included” implementation of Elixir WebRTC which was very interesting. Wojciech Barczyński spoke about their LiveCompositor tool that uses its own media server implementation for streaming applications, and is built primarily in Rust, but also has a Membrane SDK/plugin. Several other conference speakers, such as Enzo mentioned above, also used Elixir in their implementations.

Of particular note, even if you don’t use Elixir, is that Michał spoke about how they have gone to extra effort in the Elixir WebRTC documentation to include general WebRTC application development and debugging tips. He’s correct to note that many of the “how to” blog posts on WebRTC are a bit dated and there are not many sites that provide a comprehensive list of WebRTC-related FAQs. So, he and his team have been trying to document what they learn along the way of building the Elixir WebRTC project. That’s definitely a valuable contribution to the community, and along with other efforts like Sean DuBois’ “WebRTC for the Curious“, it should provide helpful information.

Of course, even though our team at WebRTC.ventures doesn’t maintain a comprehensive “how to” on our site, our blog archives do contain implementation tips, demos, and blog posts on many WebRTC-related implementations in web and mobile. So be sure to search on our blog for anything that interests you or contact us if our team can help! But let’s close with a couple other thoughts on the future of WebRTC and Media over QUIC before ending this blog post…

Ali C. Begen – “DASH and Media-over-QUIC Transport Face-Off: Performance Showdown at Low Latency”

Is Media over QUIC Transport the future of live video applications? The final talk that I want to bring to your attention was a very entertaining and insightful presentation by Professor Ali C. Begen of Ozyegin University in Turkey. Ali gave an overview of the past, present and history of streaming technologies, on a scope wider than just WebRTC. You can find his slides here.

Alis has an extensive background as a contributor to DASH. He won a Tech Emmy for that standard’s contributions to Hollywood. (This was the day I learned Tech Emmys are a thing!) As Ali reminded us, LLHLS and LLDASH (Low Latency DASH) work by chunking up samples into smaller pieces so playback can begin sooner.

Media Over QUIC (MoQ), or Media over QUIC Transport (MoQT) – as Ali prefers, goes further. MoQT is low latency for ingest and distribution, and it’s latency tunable (tuning for a longer latency allows for retransmits and higher quality).

WebRTC doesn’t provide congestion control but QUIC does as a lower level protocol. So MoQT will automatically provide congestion control by being transported on top of QUIC. In that way, MoQT could possibly replace WebRTC and HLS by providing both high scalability and interactivity, while still maintaining low latency. 

This capability is one of the key reasons that Ali likes MoQT more than WebRTC. In good humor he made jokes about WebRTC being like drinking from a firehose, while MoQT is drinking in a more controlled fashion. The outstanding question is just how low-latency will the final standard be? And will it take 10 years to be standardized like WebRTC was?

Ali is hopeful that the standardization process will be short and that the ultra low latency parts will remain in the standard. Only time will tell to what degree. Even once MoQT is here, WebRTC will still have valuable use cases to serve. It’s likely that both will co-exist for a long time.

See Ali’s talk here


All conferences are local!

We are big proponents of remote work at WebRTC.ventures; it’s not only what we do but also how we’ve always worked. But even video communications experts like ourselves still enjoy getting out to conferences and traveling. One of the benefits of working remotely with people all over the world is that it enables you to lay the groundwork for in-person travel where you can experience the things that make your colleagues’ local communities special.

A popular late-night street food in Krakow.

How else would you experience these two guys who sell amazing sausage out of a blue van late at night? That’s not the opening line to a horror movie, it’s one of the most popular late-night street foods in Krakow (in addition to a Zapiekanki, of course). I certainly would not have experienced those things if I had not gone to the RTC.ON conference in person and asked our Polish hosts what local food and drink I should try. Those are experiences that, perhaps ironically, I would have never had without remote work and video communications.

Contact us!

If you’re interested in real time communications, you might want to put the RTC.ON conference on your list for next year. 

If you’re interested in building real time video applications, don’t wait for next year! Contact the WebRTC.ventures team of experts today.

Recent Blog Posts