If there’s a single theme on the WebRTC.ventures blog, it’s that WebRTC is about much more than just video. While most applications we develop include a video component, the audio and data channels of WebRTC can be used independently of the video channels.
In this post, I explore a few of the WebRTC audio use cases that have come up across our content over the last six months. These conversations with industry experts in WebRTC, audio, and AI illustrate the tremendous possibilities for using the voice channels of WebRTC, and are very complementary to the work that our team does at WebRTC.ventures.
1. Voice as an Interface
Voice by itself is an interface, and for most of us, the most natural one of all. Whereas there are classes you can take to become more efficient at using keyboards (“Typing” was probably the most useful class I took in high school), no classes are needed to become efficient at using Voice-based interfaces like Alexa and Siri.
Once you learn the “wake” word and maybe see someone use the device a couple of times, you know all you need to know to start using that device yourself.
I remember a keynote speaker at the highly recommended Voice & AI Conference a couple of years ago saying, “in ten years we won’t even use keyboards anymore.” I always chuckle at that memory because the prediction was made while the speaker was on a Onewheel, scooting around on stage. It made it hard to take it seriously, but it did emphasize his visionary approach to the future. And, he was making a good point, if somewhat dramatically.
While I doubt that keyboards will be entirely gone in a decade, I am positive that Voice-based interfaces will continue to grow in popularity and usefulness. In a recent episode of my Scaling Tech Podcast, I interviewed Tobias Dengel, the CEO of WillowTree apps. We discussed the growth of voice interfaces, and one key takeaway from his book, “The Sound of the Future,” was the significance of multimodal interfaces in certain use cases.
Tobias predicts that most web and mobile apps will eventually include a microphone icon to use a Voice interface instead of just typing. He points out that it’s much easier to speak a long search query than to type the question into your browser. Supporting a Voice interface for input improves search results because you are more likely to supply a long tail search of more detailed terms.
On the other hand, the Voice answer to that search or query is best supported with on-screen text. This visual cue is easier to process than a synthetic voice.
As these sorts of interfaces grow in popularity, they will be primarily powered by WebRTC and its audio channel. For developers, this will mean learning how to use WebRTC to capture that audio and processing it into useful commands or content via Speech to Text APIs and/or Generative AI APIs. This could be simply used as Voice-based input to a text box or search box. In more complex scenarios, developers will need to interpret or map the Voice command to actions that can be taken in the application or website.
2. Voice for Accessibility
Voice is not just a way to handle input more efficiently. For blind and low-vision users, it is the best way to interface with a web or mobile application. In the past, accessibility for this community was primarily based on screen reader tools helping to put a voice to written content. But screen readers can only go so far, and also rely on web developers following best practices. Combining Voice interfaces through WebRTC and Generative AI offers a new path.
This really became clear to me during one of my favorite episodes of my monthly WebRTC Live webinar. In Episode #87 (January 2024), I hosted a panel discussion on Accessibility in Communication Applications. I spoke with representatives of Wordcab (Automatic Speech Recognition), SignTime (Sign Language with Digital Avatars), and Be My Eyes (App for blind and low-vision individuals to provide assistance in real-world scenarios).
The episode included a demo where Voice interfaces and Generative AI are used with video capture to assist a blind user in reading labels in a grocery store. This is a powerful example of how WebRTC video and audio can be combined with Generative AI to provide Voice-based interfaces that make a significant improvement in someone’s quality of life.
Here’s a clip of the episode showing that demo:
It’s easy to think of WebRTC as just a way to build Video meeting tools. But there’s so much possibility for WebRTC to help disadvantaged communities better access information and interact in the physical world, using video or audio capabilities depending on the need.
In the Scaling Tech Podcast Episode mentioned earlier, Tobias also shared a moving story about building an eye tracking app that helped a hospital patient “select” words with her eyes. Instead of just spelling out words on a virtual keyboard with her eyes, Generative AI is brought in to suggest the most likely words or phrases she would say next, enabling much faster communication.
3. Voice and Audio-First Applications
Continuing the theme of “more than a meeting tool technology,” what about applications where Audio is the point? Where you might not even use Video at all? Consider a use case like gaming. A gamer’s vision is focused on playing the game, but they also like to interact with other players. While this is often done via text chat, it can also be done in conjunction with audio channels which are on all the time, or only when a user activates their microphone. This requires much less bandwidth than Video, which is beneficial in a scenario like gaming where users are distributed across a wide range of network conditions and devices. It’s also just fun!
There are many other use cases where cameras might be off, or just not the focus. Watch parties, virtual karaoke, and music collaboration are fun examples. There can also be more serious examples such as medical dictation and note taking applications which automatically provide transcriptions and summaries of medical visits, allowing the medical provider to stay focused on the patient.
In another recent episode of WebRTC Live, I saw some interesting demos of Audio only applications that can be built on top of browser-based audio. In episode #91 (May 2024), I spoke with Jim Rand and James Meier from Synervoz, where they showed me a demo of their Switchboard SDK. Switchboard allows developers to feed audio streams into an audio processing pipeline from a variety of sources, including WebRTC. Once in the pipeline, you can apply effects to the voices or other audio for some interesting applications that you could not build with a typical meeting tool API.
In this clip, Jim showed us how different guitar effects can be applied in the browser in real-time to an audio stream from a guitar player.
4. Voice in the Contact Center
Contact center applications, often Voice-first, are a big part of our work at WebRTC.ventures. Most customer service applications are still telephony-based and Audio only. Using tools like Amazon Connect, we help traditional on-premise contact centers move to the cloud, though often this first migration is still ultimately a telephony-based.
WebRTC does allow for the incorporation of additional communication channels into the contact center. The most obvious reason for implementing Video into the contact center is to allow for sign language interpretation. We’ve done this for multiple clients in use cases like hospitals and legal interpretation. Other use cases for Video in the contact center involve things like insurance claims and field service applications where it’s helpful for the agent to see what the customer is seeing.
Still, most contact center implementations are predominantly Audio-only, and WebRTC still plays a vital role. Use of the WebRTC audio channel allows for “Click to call” to be built into website and mobile apps, so users can connect directly with the contact center without dialing a phone. The benefit is two-fold. Customers get the convenience of contacting customer service from the same app where they may be having trouble. On the other hand, the contact center can leverage the same customer contact solution, regardless of where the customer is contacting them from.
Beyond incorporating different forms of “in-app calling” into the contact center, modern technologies allow for much more. I explored this with Dan Nordale from Symbl.ai in WebRTC Live #88 in February 2024. We talked about many different use cases for LLMs in WebRTC-based applications, such as call transcription, sentiment analysis, and keyword detection.
LLMs in a Voice application can go well beyond recording and summarization type of uses. They can also become an active participant. Dan and I talked about how combining automation and human agents can improve customer satisfaction. In situations like an “agent assist,” the LLM helps the agent with additional information to solve the customer’s issue.
Bonus: The Importance of Audio Processing in WebRTC
WebRTC presentations often talk about video quality, which is very important in a Video-first application. In many of the use cases I’ve described in this post, the audio quality is more important. It becomes very important to understand how audio processing is done in WebRTC and how different standards, frameworks, and third-party tools can come together to improve the audio in a WebRTC application.
In one of my monthly WebRTC industry chats with Tsahi Levent-Levi, we talked about WebRTC, machine learning and audio processing.
The Ongoing Importance of Voice in WebRTC Applications
WebRTC is about so much more than meeting tools and video applications. As you’ve seen in this range of use cases for Voice-first applications using WebRTC, there’s a lot of value we can provide if we’re creative and have the right range of knowledge and experience.
Today’s ideal WebRTC Developer has experience in Video technologies and standards, audio engineering, web and mobile development; an understanding of Generative AI; and an eye and ear for video production. That’s a lot to find in a single person. Finding it across a team is what will enable you to build the most innovative and valuable applications based on WebRTC.
Our team at WebRTC.ventures has all of these skills, plus the experience of applying them across many different use cases. In those rare cases when we don’t have direct experience ourselves, we have partners and friends across the industry who do!
How can we help with your WebRTC project? We’d love to hear from you!