WebRTC Live 104: Why Vision Language Models (LLMs) Deserve a Closer Look

Large Language Models (LLMs) have dominated conversations about AI integration in WebRTC, particularly when it comes to voice-based features like transcription, summarization, and intent detection. But there’s an emerging layer that many outside of research circles are missing: Vision Language Models (VLMs). Unlike LLMs, which work with text and speech, VLMs are capable of understanding and generating language based on visual inputs—opening up new possibilities for analyzing what’s happening on camera during a WebRTC session.

In this episode of WebRTC Live, we speak with Yahia Salman, an AI researcher at the George Mason University National Language Processing Lab, where he is also a Computer Science and Physics student. Yahia won an award at Stanford University’s TreeHacks 2025 for building ZoneOut, an application that integrates Zoom Real Time Media Streams (RTMS) with both LLMs and VLMs, plus Retrieval Augmented Generation (RAG), to extract meaningful insights from both video and audio in virtual classrooms. He recently presented this work at the Zoom Developer Summit.

Join us as Yahia shares the architecture behind ZoneOut, what it’s like to work with VLMs, and why he thinks they deserve more attention in the development of real-time applications

Bonus Content

  • Our regular monthly industry chat with Tsahi Levent-Levi. This month’s topic: Do We Really Need End-to-End Encryption (E2EE) in WebRTC? You can also watch this content on our YouTube channel.

Key insights and episode highlights to follow.

Watch Episode 104!


WebRTC Live will be taking August off!

Our next episode will be in September, live from the RTC.ON conference in Krakow, Poland.

Register for WebRTC Live 105

Recent Blog Posts