WebRTC Live 104: Why Vision Language Models (LLMs) Deserve a Closer Look

Large Language Models (LLMs) have dominated conversations about AI integration in WebRTC, particularly when it comes to voice-based features like transcription, summarization, and intent detection. But there’s an emerging layer that many outside of research circles are missing: Vision Language Models (VLMs). Unlike LLMs, which work with text and speech, VLMs are capable of understanding and generating language based on visual inputs—opening up new possibilities for analyzing what’s happening on camera during a WebRTC session.

In this episode of WebRTC Live, we speak with Yahia Salman, an AI researcher at the George Mason University National Language Processing Lab, where he is also a Computer Science and Physics student. Yahia won an award at Stanford University’s TreeHacks 2025 for building ZoneOut, an application that integrates Zoom Real Time Media Streams (RTMS) with both LLMs and VLMs, plus Retrieval Augmented Generation (RAG), to extract meaningful insights from both video and audio in virtual classrooms. He recently presented this work at the Zoom Developer Summit.

Join us as Yahia shares the architecture behind ZoneOut, what it’s like to work with VLMs, and why he thinks they deserve more attention in the development of real-time applications

Bonus Content

  • Our regular monthly industry chat with Tsahi Levent-Levi. This month’s topic: Do We Really Need End-to-End Encryption (E2EE) in WebRTC? You can also watch this content on our YouTube channel.

Key insights and episode highlights below the video.

Watch Episode 104!

Key Episode Insights

What are VLMs and why are they essential for the future of AI? Language alone can’t capture the whole picture, especially in applications like virtual meetings, online learning, or real-time analysis where visual context matters just as much as words. VLMs bring AI closer to human-level understanding by allowing it to process images alongside text. Yahia explains, “I’m sure most people are familiar with LLMs, and probably a lot of you are familiar with VLMs, but it’s essentially the same thing as an LLM, except that it has vision capabilities. So you can give it an image and it’s able to understand and extract some data and do some processing on it or whatever your use case is.”

VLMs vs. LLMs: What’s the real difference? At a glance, they share a nearly identical architecture. But as Yahia explains, the key difference lies in how they’re trained: “What makes a VLM different from an LLM is in the pre-training process. Along with training it on the common crawl or whatever massive data set, they also train on a massive data set of images that they follow through this exact same way that I just mentioned. So in terms of architecture, it’s super similar. There’s just some additional step that’s put on top of it, but somehow we found that just that little bit makes these exact same models be able to understand all this context.”

The importance of building with flexibility. In a rapidly evolving world of AI, models improve constantly and dramatically. Yahia stresses the importance of building applications with flexibility and future-proofing in mind. He says, “One of the biggest things to keep in mind in the sphere that we’re in, where things change almost every day, is to build applications with the idea that the models will get better. And so a lot of start-ups that will be built right now or applications that we built get instantly destroyed and have to disband once OpenAI updates their model, or something like that, because all of a sudden OpenAI supports their full functionality or whatever. So build it with the idea that your app will get better as the model gets better.”

Episode Highlights

How the idea for ZoneOut was born

Like many innovative projects, ZoneOut was sparked by a real-world pain point that Yahia and his colleagues experienced firsthand. Yahia shares the story,

“We were like, ‘Okay, how do we fix this problem? So first we were thinking, ‘What if we just take the teacher, transcribe what they’re saying, save it in some way and be able to retrieve it?’ But all of us, I’m physics and CS; they’re both computer science. And a lot of stuff that’s taught in these classes isn’t necessarily said out loud. You have long, complicated equations, like in physics, you have Schrodinger’s equation, they’re very long. And so the teacher is not going there reading the entire equation, she’s just kind of writing on the board and pointing, so we need some way to capture this vision stuff. And so that’s where the project of ZoneOut was born, the idea of ZoneOut was born. And the whole idea is that it’s like a buddy watching with you, so that whenever you miss something or you forget, you can ask it and it will just tell you what’s going on.”

VLMs offer the potential to make AI more intuitive and human

As AI keeps evolving and the world around it changes, the ability to truly understand real-world contexts becomes even more important. That’s where VLMs jump in. They help AI see the world as we do.

Yahia explains, “I think it boils down to simply: we don’t live in a text-based world. The way we communicate, we don’t just write to each other and then read. We observe so much. And so if you think of how you learn almost anything, it’s by watching people do things. Very rarely do you have somebody who will just read the textbook, ignore all the graphs, all the charts, all the anything, and still understand everything fully. A lot of things, whether you’re learning how to work out or you’re learning how to golf or you’re learning how to do whatever, you’re kind of watching somebody do it and so having only descriptions of stuff seems kind of a naive, not naive, but just a very small first step into what the future or AI really will hold.”

Emotional recognition remains a big challenge for AI models

Emotion recognition is a challenging task for AI models because humans don’t often show what we feel or think. Our facial expressions don’t always reflect our internal state, which makes it especially difficult for AI models to accurately detect emotions in real-world settings. Yahia explains, 

“I worked on the paper that the research I was talking about earlier was in emotion recognition and things like that. One insight, one of the biggest insights I got regarding emotions, is that a lot of times, the emotions are so subtle you feel them way stronger in your head than you feel them on your face. If you think back to a time you were really mad, you won’t sit there like that for five minutes or something like that. It won’t really happen. You’ll probably just sit there with a blank expression. So if you gave it somebody that’s like this, I’m so sure I’d be able to tell they were angry or somebody smiling that they’re happy but in the real world in terms of application based emotions a very difficult thing because we don’t show it a lot of times and we don’t show what we’re thinking a lot of times.”


WebRTC Live will be taking August off!

Our next episode will be in September, live from the RTC.ON conference in Krakow, Poland.

Register for WebRTC Live 105

Recent Blog Posts