On the October 16, 2024 episode of WebRTC Live, host Arin Sime welcomed Rob Pickering, CEO of Aplisay, to explore the fast-changing landscape of conversational AI and streaming speech.
Bonus Content
- Our regular monthly industry chat with Tsahi Levent-Levi.
Watch Episode 95!
Key Insights
⚡ We’re slowly moving to more fluid, natural conversations with AI. Instead of us constantly adapting to AI, these systems are beginning to adapt to our way of speaking. In this episode of WebRTC Live, Rob talks about how speech-to-speech models will facilitate this evolution. He explains, “If we go back to the Alexa time period, we all got an Alexa, and we sort of trained ourselves to talk to it the way it wanted to be talked to, so we did that walkie-talkie thing of, don’t interrupt because that doesn’t work too well, let her finish and then when she’s finished, tell her where she’s gone wrong and work out the 10 trigger words apart from Alexa or Hey, Google. […] What’s then happened really since that mid-2023 period is a couple of things really. One is that we’ve got better and better optimizing that pipeline so newer, quicker speech-to-text with a much shorter latency, streaming so that we can start picking out some of those words in the transcription much early while I’m picking them out, sending them off to a large language model.”
⚡ The best thing about AI is its flexibility. AI is an incredible tool that we’re lucky to have at our disposal. Its adaptability allows you to tailor it to your own needs and circumstances. Rob says, “The brilliant thing is […] you build an agent definition once and use it on all those platforms. There are plenty of other people who are doing this sort of stuff, and it’s not particularly hard to go build against the different environments anyway. So, I don’t think I would recommend anyone today go off and target a pipeline if they’re starting a new development. At the very least, make your stuff run under a speech-to-speech model as well. And if costs are an issue, wait for them to come down.”
⚡ The impressive capabilities of AI come with a significant cost. Despite AI’s massive potential, the associated costs may pose a barrier for many users. Rob says, “The capabilities of some of these models are fantastic. The only thing that I am going to say at this point is actually that kind of needs saying, if you look at the costs involved in some of these multimedia models, they are utterly breathtaking. So the OpenAI model is currently basically 100 dollars per million tokens input 200 dollars per million tokens output, which equates to about six cents a minute for voice input and about 24 cents a minute for voice output in a typical conversation. So you’re looking at dollars and dollars an hour, 20, 30 dollars an hour for an AI agent, which clearly is going to limit the use cases of this stuff substantially. Is that where it’s going to stay? No. Of course, it isn’t.”
Episode Highlights
New breakthroughs in speech technology have made a significant impact on the industry.
The latest advancements in speech-to-speech models remove the need for cumbersome speech-to-text conversions, raising important questions about the future of AI development. Rob says,
“We all started building this stuff about a year and a half ago, and we’ve got really quite good at this, some of these pipelines, and if you look at some of the newest speech-to-text engines, we’re starting to get latency down. And this thing is feeling a little bit less like a walkie-talkie conversation, and we’re feeling a little bit good about ourselves because we’ve managed to do all of this. And then OpenAI did what they called the spring update, which I think was kind of roundabout the start of June, end of May, where they did this fantastic demo of the speech-to-speech model and we were all blown away because what that means is they’re tokenizing the speech directly into the large language model ao all of this clunky speech to text and then text and back out again goes away. And ever since then, we’ve been sitting there thinking, what does this mean for how we’re going to build this stuff?”
Speech-to-speech technology aims to improve the quality and efficiency of human-AI conversations.
The primary goal of speech-to-speech technology is to minimize latency and create more seamless interactions. Rob says, “You can see there that it’s got the same sort of very short latency or much better than very short latency we’re actually getting real completions back all the way through to speech from the LLM right away.
Arin adds, “No more okays in the middle.” Rob continues, “Suddenly that feels like we’re talking to something that’s being really quite responsive. I’m not sure about the whole idea of trying to persuade people they’re talking to another human. I don’t think that matters. I think what really matters is that we’re not frustrating people because the conversation is kind of slower.”
The video-to-video model has huge potential for improving conversations.
Video-to-video modality represents another exciting advancement in AI technology. Rob explores its immense potential and highlights the main benefits of this innovation. He says,
“All of a sudden, when you start being able to add video to that because there’s body language, then that starts to get to a point where the AI can become, and I hate to use sort of anthropomorphic words, but perceptive. It’s able to perceive a lot more about what’s going on in the conversation, just the way that we do if we’re in a noisy room, and we kind of figure out where the conversation is, even if we’re losing 50 percent of the conversation. So, I think for that reason, we’ll end up doing it. That’s the most serious use case I can see for it. There are all sorts of other use cases around sign language and impairments and all of that sort of stuff. So I hope that’s the use case that we come up with for video is enhancing conversations rather than doing wild and evil stuff of impersonating people.”
Up Next! WebRTC Live Episode 96
Call Quality at Scale – Balancing Automated Monitoring and the Human Factor with Luca Pradovera of SignalWire
Wednesday, November 13, 2024 at 12:30 pm Eastern.