Your users report poor call quality, a dropped call, or a connection that never got established. But what actually happened?
In this episode of WebRTC Live, we’ll break down what commonly fails in production WebRTC apps, how experienced teams debug live incidents, and how to build the visibility that keeps you ahead of problems.
Guest host Alberto Gonzalez, CTO of WebRTC.ventures, sits down with Justin Williams, Senior WebRTC Engineer at WebRTC.ventures. Justin brings hands-on experience building real-time media systems with Vonage, Amazon IVS, Amazon Chime, and more.
They’ll cover the tools experienced WebRTC teams rely on, including webrtc-internals, Playwright, and Peermetrics, and what it takes to move from reactive debugging to proactive monitoring.
The episode also features Arin Sime and Tsahi Level-Levi’s Monthly WebRTC Industry Chat. This month, they discussed two topics: WebRTC Transport in Safari! And is AV1 truly royalty-free? Watch on YouTube.
Watch Episode 112: How Experienced Teams Debug and Monitor WebRTC in Production
Episode highlights and key insights to follow.
Key Insights
⚡ Failure types in production are complex. The hardest issues often look the same on the surface but come from completely different layers, making fast diagnosis extremely difficult. As Justin explains, “The biggest and most painful category for WebRTC apps is, of course, around network and media path failures. There can be a lot of reasons why users might hit these types of problems. That makes the space maybe a little bit intimidating to tackle things like ‘the call never starts or one user can see and hear everybody else, but nobody else can see them. So, is that a network topology issue or is it track negotiation or maybe a codec mismatch or is the user’s camera just unplugged? Those tend to be fairly hard to decipher and not always application-level bugs.”
⚡Debug systematically to uncover the true failure. User complaints like “bad call quality” or “it didn’t connect” don’t give you enough to act on. You have to translate them into concrete failure signals before you can diagnose anything meaningful.
As Justin explains: “When a customer says something like call quality is bad or the call didn’t connect, there’s probably a few different places to kind of start with. But that’s definitely something that happens all the time. If the call is not working or if something is not working it’s like okay, well, how bad was it? Choppy? Was it frozen? Was the video audio out of sync or did it never connect at all? And so each of these points could be a different part of the stack and the worst thing you can do from there is just guess off of what the users are telling you. So ideally what you have here is that your logs are comprehensive enough to hopefully answer these types of questions. From there we want the logs to help us look at three things, ideally in this order: which is connection state and then media flow and then adaptation.”
⚡ You’re not optimizing for perfect metrics, you’re optimizing for user priorities. In some apps, uninterrupted audio matters more than video. In others, visual clarity is everything. The “right” trade-off depends on the experience you’re designing. Justin explains, “It’s really a challenging thing and it takes a while to figure this out. And for different use cases, different projects, it might mean different things. Usually, actually bad metrics are hidden to the user and are not impactful. So kind of situationally too, like what’s the context of the application? Maybe sometimes if you’re just thinking about something like Google Hangouts or Zoom, you kind of more care about the fact that you can hear other users.”
Episode Highlights
Real-time applications are unpredictable
Unlike traditional systems, real-time applications are asynchronous, stateful, and constantly changing across multiple users and connections, making them significantly harder to reason about, reproduce issues in, and debug in practice. As Justin explains:
“I think probably the big one is almost everything important with these apps: they’re asynchronous, time-dependent, and stateful across multiple pairs or clients. There’s a lot of moving parts and applications can end up in a lot of different states. So the problem of ‘it works on my machine is all too common.’ Reproducing an issue that QA found tends to be way less trivial than on more static applications.”
Real-time WebRTC systems require a strong testing foundation
Because real-time applications are tightly interconnected and sensitive to change, even small updates can create unexpected issues across the system, making testing a core part of design, not just QA.
Justin explains, “Having a really solid testing foundation is very important for real-time apps because they’re so complex, and a change in one place can unexpectedly affect somewhere seemingly unrelated in a totally different area. And with how many different states these apps can end up in, it’s hard to be exhaustive. Bugs just always pop up in weird edge cases. But one thing that really affects your testing approach might be the architecture underneath the app, so specifically whether you’re running your own open source or custom media servers, or maybe you’re relying on a third-party CPaaS. So if you’re maintaining your own servers there’s maybe a bit of more responsibility on you. If you’re on a CPaaS it’s important to stay on top of testing because things might change without you having much control of it. Overall, you definitely want a lot of testing, but the strategy might slightly change based on that.”
Connection lifecycle tracking is key to detecting real-time issues
In real-time systems, the most effective way to catch problems early is to monitor event counts across every stage of the connection flow, so you can see exactly where users succeed or drop off before they start complaining. Justin says,
“Starting to look into monitoring, where should teams maybe monitor in production? To catch issues before users start to complain, ideally. But if I had to start with one thing, event counts at every stage of the connection lifecycle kind of really goes a long way. So, if you have a high scale of users, ideally you want to be tracking these things. So you can see how many users click join and then from there, how many users clicked or got ICE connection state connected, how many published a track, how many received the remote track, how many made it into five minutes without dropping. So if you can set up all of these to kind of funnel down what your users are going through, you get a really clear insight to where users are starting to have issues, where users dropping off.”
Up Next! WebRTC Live #113
WhatsApp Business Calling and SIP
Wednesday, May 13, 2026 at 11:30 am Eastern
