WebRTC.ventures was founded as a custom design and development firm specializing in building live video applications. In the past, most of the time we turned the final code over to our clients to manage in production. In reality, however, we would often stick around to help support their teams in maintenance and management of the application in production. To recognize that necessity, we have been formalizing our Managed Services offerings in 2023.
Many of our clients asked us to do this work because no one knows the application better than we do. In this blog post and subsequent ones, I’ll begin to share some of what we have learned about maintenance and support of complex communication applications. If you don’t have a Managed Service Provider (MSP) for your application, check out our MSP Program Manager Rafael Amberth’s post, Why Your WebRTC App Needs an MSP.
Prior to our taking on the role of an MSP, we designed our applications to make their management as easy as possible. You can say that we have even more reason to do that now! This post will serve as an introduction to a few that will follow on application design factors that facilitate the support process for WebRTC and other kinds of communication applications.
Communication applications inherently more complex
As a prelude to that conversation, I want to discuss preparing for, and investing in, complexity.
“… when things go wrong, if you’re blaming the human, that is the ultimate human error.”
– Author and security expert Kelly Shortridge, Scaling Tech Podcast Episode 23
Please pardon the shameless plug for my podcast about managing engineering teams, but I mention it because I really liked this quote from my interview with the author of the book Security Chaos Engineering. Blaming the human is a classic mistake and the easiest thing to do. But really, in the context of software development when things go wrong it is usually an indication of something we didn’t design for or something wrong in our system or process. If we design properly in advance, and prepare for problems, we’re always a step ahead.
This is harder for communication apps than your average web application. From networking to call quality to user error, the act of connecting users by video (or audio, or chat) is inherently harder than simply pulling up a browser and reading a website.
Working with WebRTC, in particular, is in some ways a bit easier than it was a few years ago. In other ways, it has become more complex. Because WebRTC is being used in a larger variety of use cases with different needs and concerns and also at a larger scale than ever before, it helps to have enough experience to think ahead in order to manage that complexity.
With that said, let’s move on to the first topic.
Designing for Observability
Designing for observability means building logging, monitoring, and alerts into both the application and the DevOps around it. We make sure that we’ve got things configured in advance for alerts that get to our team (or yours) through Slack and email, having good logging and exception handling, and capturing as much context around potential errors as possible so we can troubleshoot them when they occur.
Ideally, we’ve got enough logging and monitoring built in that we recognize the errors happening prior to being contacted by the customer. But of course, we need to have the process around receiving that error, as well.
Monitoring Tools
Here are some tools we use for general monitoring. As most of the work that we do is on the AWS stack, a lot of the tools that we’re building are around that.
- Alerts – Amazon EventBridge events + Amazon CloudWatch alarms with integration to Slack and Email notifications
- Logging – Amazon CloudWatch logs and Amazon S3 for logs storage
- User and Error Monitoring – Bugsnag exception tracking and troubleshooting context
- Infrastructure Monitoring – Uptime for multi-location pings, monitoring, screenshots matched with error logs, and on-call alerting
- Support Tickets – Zendesk
Invest in Success
When teams don’t design for observability in advance, they have to stumble around in the dark when errors occur. That leads to delays in resolving the error, increased downtimes, unhappy users, and unhappy engineers. So while designing and building in observability adds in some upfront time and ongoing costs to the application, it’s well worth the reduced stress and shorter downtimes when problems occur.
Designing for observability is what allows us to quickly move beyond blaming the humans, and instead focus on identifying and resolving root causes. The data we gather from observability will allow us to have more confidence that we found the specific situations in which the problem occurs, and put better improvements and controls in place in the application and our processes to prevent the error from happening again.
In future posts, I will talk about designing for resilience, testing, security, and change. Stay tuned!
In the meantime, I invite you to learn more about our Application Deployment and Management Services and let us know if we can help you monitor and manage your application.