Going Multimodal

Part one in a two part series on the intersection of Voice and Conversational AI technology with Video applications. Read part two.

In the business world, we often talk about breaking down silos. For example, any class on data structures will talk about noble goals of integrating different data silos – evoking imagery of thousands of old spreadsheets finding their way to a promised land of data lakes and synthesized queries across disparate data sets. All leading to stunning business outcomes which could not have been imagined by a humble relational database designer.

There’s a similar pitch going on in Customer Experience. Here, the pot of gold at the rainbow’s end is Conversational Artificial Intelligence. I went to the Voice22 industry conference and trade event in Arlington, Virginia last week to drink the Kool-Aid. After a strong draw from the source, I’m here to say that I’m very much sold on Voice and Conversational AI. But I’m even more intrigued by the intersections of Voice and Video.

In this two-part blog post, I’ll start with background and definitions, summarize some learnings, and mention interesting companies I met at Voice22. In the second part, I’ll conclude with thoughts on how Video, Voice and Bots are merging in customer interactions.

To be clear, Voice22 is about, well, Voice. It’s not really about Video. Here at WebRTC.ventures, we’re about Video, specifically live video applications. We do also care about Voice, and I think there are some silos that need to be broken down between the two. So, let’s get started!

A quick primer on what came before Conversational AI

Conversational AI, sometimes further shortened to CAI, is about creating a customer experience that is less like an FAQ page. Instead, it flows more naturally, the way that a conversation between a customer service agent and a customer might if they were speaking in person or on a phone call. From here on, I’ll refer to a live human customer service agent as an Agent. The real human asking for help as the Customer.

The problem with the traditional way of doing customer service is that it’s expensive, it’s slow, and it’s not always helpful. No human can be trained to know everything. It’s too expensive for a company to hire a surplus of Agents, so Customers may wait on hold for long periods of time Once the Customer does speak with an Agent, the Agent may not be able to readily answer their question. They may have to transfer them to another Agent – likely with another expensive and frustrating wait.

In a traditional phone call, companies first tackled this challenge with push-button menus or Interactive Voice Response (IVR). It is frustrating for the Customer to negotiate a menu system that they did not design and may not have a simple path for their needs.

On websites and in apps, companies have tried to improve on IVR by adding in text-based chat bots. Initially, these just functioned like an FAQ page, with menu-driven interfaces for Customers to navigate. Then they evolved into more free-form text entry, where the Customer is prompted with an open statement like “How can I help you?” A backend system tries to parse out the key words in the Customer’s text question, and then often presents text driven results that look, once again, like an FAQ page.

A sidebar on Unified Communications

Before big enterprises figured out how to make Conversational AI, a different term was being thrown around a lot: Unified Communications. This is corporate speak for allowing a Customer to speak with an Agent via text chat, WhatsApp, SMS text messaging, Voice calls (either VOIP or traditional telephony), and any other messaging app they could find an interface to. A related term is Omnichannel communications, where again customers can interact with a business seamlessly across multiple channels.

In the fancier implementations of Unified Communications, a Customer could continue a seamless conversation with an Agent or a Bot across any of those different communication channels, even as they switch between those channels. The context and details of the conversation would carry along with them.

Although I’ve never personally been in a customer service situation where I had to frantically switch from a WhatsApp message to a Facebook Messenger chat to an SMS text chat as I crossed various firewalls or signal coverage, this is one of the scenarios offered by Unified Communications. We demonstrated the concepts ourselves in a telehealth hackathon project at a TADHack event way back in 2016, and in 2021 I wrote a post about Omnichannel communications using the Vonage Conversation AI.

Me at the @Vonage booth at @VoiceSummitAI with @timholve and @dianasoyster

If you want to use a single communication API for many different types of communication modalities like text chat, SMS, social media, and voice, that is definitely something we at WebRTC.ventures can help you with. We partner with companies like Amazon, Vonage and Twilio and can integrate their APIs into your application.

The term Unified Communications has always sounded a bit too enterprise-y to me. While these are still very important concepts, when compared to Conversational API concepts, they are only the plumbing through which a conversation may travel.

Text Bots evolve towards Conversational AI

The next step has been trying to make those text bots understand and talk more like a human. This is where we can start to use the term Conversational AI. The technical backend is not the sort of full-fledged AI you see in movies. Oftentimes it is still largely based on the company configuring a tool with many keywords and adjectives, trying to anticipate the types of questions and terms a Customer will use when asking a question.

The Conversational AI tool or API is then smart enough to be able to expand somewhat on what the company has anticipated Customers might ask, and provide an intelligent answer supplied by the company. The company can use a low-code Conversational AI solution to do all this parsing across different communication methods (text chat, SMS, a voice call, etc), and then connect that application to their own internal corporate APIs in order to retrieve Customer information like order status, shipping details, etc. Or, the system can route the Customer to an Agent for further discussion.

A good example of this type of low-code Conversational AI is the AI Studio by Vonage, which we have previously blogged about integrating AI Studio with an example telehealth scenario. While sometimes people might refer to these as “no-code” solutions, I always think of them as “low-code” instead because of the necessity of connecting them to your corporate APIs to gather that custom data. You’re almost certainly going to still need a technical team to help you with this. And, that team will still need some of those magical and hard-to-find unicorns we refer to as software developers or programmers. Of course, I would selfishly say that because that’s what we do at WebRTC.ventures.

Me and the @symbldotai team at @VoiceSummitAI – they have a great API for conversational intelligence that can be built into video apps like we build for our clients at @WebRTCVentures. We wrote a blog post showing how to integrate Symbl.ai real time transcription with the Vonage Video API – it worked great!

All of this leads to significant cost-savings to the company, who may not need to staff so many Agents at once. It is also less frustration for the Customer, who gets their responses faster.

The state of Conversational AI at Voice22

I attended the Voice22 conference as a way to better understand the current state of Conversational AI. It’s no surprise that things have come a long way since the last conference I attended before the pandemic.

For one thing, “Voice” and “Conversations” are widely-encompassing terms now. It could refer to a Voice-bot who speaks to you over a traditional phone call, a text-based chat bot, or even a photo-realistic avatar like Ericsson’s Digital Human project. It also includes devices using Voice-bots like Amazon Alexa, which was well represented at the conference, as well as Apple’s Siri and Google Assistant.

These voices are being used everywhere, not just in the device you bought for your kitchen counter. Jeff McMahon from Voicify talked about how Voice is transforming 100-year-old companies and industries. Examples ranged from voice assistants in your car all the way to restaurants addressing labor shortages by having an intelligent answering service answer common questions and take reservations.

From here on, I’ll refer to Bots generically, regardless of which device you may be communicating over, if you’re using your human voice and hearing a response, or typing on a keyboard and reading responses. A big selling point of most of these solutions is that they support all those communication modes. Since they can integrate with many different types of scenarios, the focus is on providing a good conversational customer interaction regardless of the form of communication.

No more shitbots

Beyond which type of Bot you’re interacting with, it was clear to me from attending Voice22 just how much better these Bots are then they used to be. In his book “Age of Invisible Machines”, Robb Wilson writes about his desire to “lift users and organizations out of the seemingly infinite shitbot doldrums.” (Shout out to Wilson’s team at OneReach.ai for giving everyone at the Voice22 conference a copy of the book, which I’m enjoying on a plane as I write this.)

Bots no longer have to be a shitbot or simply an FAQ page. Conversational AI technology has progressed so far that if it weren’t for a few awkward pauses here and there, you’d have a hard time distinguishing between a human Agent and a Bot, even on a telephone call.

The technology has progressed well beyond dictionaries of common words combined with term parsing to help figure out what a Customer may be asking. Surveys show that Customers are beginning to trust Bots more, as James Poulter from Vixen Labs pointed out in their Voice Consumer Index, further adding that Voice app usage is going up 7% year-over-year.

Empathy and Voice Quality

The quality of the voices themselves is also really remarkable now. When using a generic Conversational AI tool and Text-to-Speech, you may still encounter robotic-voices. There’s an argument to be made that this is ok if you want the customer to understand that they are speaking with a Bot and not an Agent. This will be less common as the technology continues to improve and companies want a Bot that is more empathetic to their customers’ concerns, and can speak more like them. As Tarren Corbett-Drummond from Ericsson said in a panel discussion on the future of Conversational AI, “the more human your digital interface is, the more forgiving a customer will be.” This pays dividends if the Bot does have to ask the Customer to repeat things, or when the Customer is left with a resolution which is not totally satisfactory.

As an example of both empathy and incredible voice quality, Nikola Mrkšić from Poly.ai demonstrated a localized Voice Bot for a funeral home. The Bot was able to understand an elderly caller with a thick accent and displayed appropriate empathy when the caller discussed a relative who died. The Bot itself was using an accent that was localized to the region of the funeral home. It was a remarkable conversation! Companies like Poly.ai and others will create custom Bot voices which are best representative of your brand or which your Customer will best relate to.

Synthetic Voices

These custom voices are referred to as “Synthetic Voices”. Ryan Steelberg from Veritone discussed the use of Synthetic celebrity voices in a very interesting keynote. This goes way beyond that time I changed my Waze app to have Christina Aguilera give me directions and I couldn’t figure out how to have her stop telling me the same story about life being more about the journey than the destination.

Part of Veritone’s work is the licensing and use of celebrity voices. Beyond the surprisingly interesting discussion of how a family estate can license the voice of a dead celebrity (maybe if my Scaling Tech Podcast or WebRTC Live takes off, my family will have to deal with this? LOL), there are many reasons a living or dead celebrity voice might be used.

The reality is you’re probably listening to synthetic voices now and don’t realize it. Ryan talked about how national radio stations will produce local weather reports using the synthetic voice of a national radio host because they don’t have the time to do something that personal across their vast corporate radio networks. While I’m personally a big fan of local community radio, it’s very interesting to hear just how high quality these weather reports are. They sound exactly like the real person’s voice. This same approach is used prolifically by other companies like DraftKings. Veritone’s booth showed an impressive demo of synthetic voice commentary for a soccer game based on a statistics database integration.

One more thought I have to share from the Voice22 sessions is about the concept of “multi-modal”. The term was used frequently by speakers and the most common example given is how a Bot may communicate with you in multiple modalities during the same conversation. This is different from the Unified Communications concept I discussed earlier.

Imagine that you are using an Amazon Echo Show device on your kitchen counter and you ask Alexa to tell you about the weather. It will show you images while also reading the weather report to you. Or perhaps while explaining the next recipe step to you, it shows you a video of how to fold in the cheese while mixing.

Multimodal is a good segue to my point about breaking down barriers about Voice and Video because it reminds us to think beyond the direct use case or communication channel that our Bot is using with our Customer. Just because the Bot is text chat based doesn’t mean it can only communicate with the Customer via text. Playing an audio file or showing a video may help to explain something better. We should consider adding in that modality in order to create the best customer experience.

The future of Voice and Video in Customer Experience

At Voice22, there was a lot of agreement that the interface of the future is Voice. And that Voice you are communicating with may not be a real human in the moment, but some form of Synthetic Voice and Conversational AI. In one keynote, “optimistic futurist” Ian Utile predicted that our grandchildren will not even know how to use a keyboard because voice interfaces will be the only way they are used to communicating with machines.

If the future is Voice, where does this leave Video?

That’s what I explore in part two in this series. We talk about why and when human Agents still matter (spoiler alert – they do!), how to connect Voice with Video, and explore another concept that I learned about at Voice22 – Hyperautomation.

The best ways to keep up with our blog posts at WebRTC.ventures are to follow us on Twitter and join our email list! And when you’re ready to build your own innovative application at the intersection of Voice and Video, then contact our experts!