On January 10, 2024, Arin Sime welcomed a panel of experts each with a unique perspective on meeting the challenge of accessibility in communication applications.
While web applications have consistently prioritized accessibility, conversations around accessibility in video applications often start and end with captions for the hearing impaired. With the emergence of AI services capable of real-time audio and video processing, we have the potential to enhance accessibility in communication applications significantly.
Connecting blind and low-vision individuals with sighted volunteers in video conversations
Be My Eyes is a mobile app that connects blind and low-vision individuals with sighted volunteers who help them over a video conversation. In 2023, they launched Be My AI™, a state of the art visual assistance tool powered by OpenAI’s GPT-4 vision model. Our guest, Be My Eyes Vice Chair Bryan Bashin, shared a remarkable demo from a TikTok influencer using it at a grocery store. They also recently launched Be My Eyes Groups, where individuals who are blind or have low vision can build their own custom network of people to help with everyday tasks, like reviewing a paper mail or reading personal documents such as financial statements, bank communications or medical prescriptions.
Translating Text into Sign Language with Digital Avatars
Sign Time GmbH’s SiMAX software translates text into 3D animated sign language reproduced by digital avatars. CEO Georg Tschare showed a demo and detailed a host of use cases where SiMAX can assist deaf people in accessing information that would normally be unavailable to them, including videos, films, web content, medicine packets, museum information, and much more. For some use cases, it can help translate automatically and in real-time like loudspeaker announcements in public transport systems. They expect that in the coming years they will be able to translate everything fully automatically.
Automatic Speech Recognition (ASR)
Wordcab is a conversation intelligence suite that leverages AI to transcribe speech and generate qualitative insights. Their speech-to-text API can transcribe text in 57 languages, from Afrikaans to Welsh, making it possible for people with hearing impairments to access information presented verbally. Founder and CEO Aleks Smechov shared a demo of their upcoming consumer product, a super-charged transcript editor called Wordflo (launching January 30) that is designed to be the fastest and most accurate editing experience ever, and that also can be used for Closed Captions.
WebRTC.ventures’ work building accessible applications
Arin shared that an interesting aspect of WebRTC.ventures’ work and building video applications is that we get to be a part of improving accessibility, yet too often software applications are not built with accessibility in mind. He spoke about non-video specific projects for US state and federal agencies that must assure accessibility in compliance with Section 508 such as screen readability, descriptive tags, and contrasting color choices for low vision. And video projects like Aira.io Visual Interpreting Service with on demand trained interpreters and building video into call centers for sign language interpretation and language translation scenarios.
For more detail on these topics and more:
Watch Episode 87!
Up Next! WebRTC Live Episode 88
with Dan Nordale of Symbl.ai
Wednesday, February 21 at 12:30 pm Eastern
Episode Transcript
The following transcript was produced using WordCab.
[00:00:11 -> 00:01:40] Arin Sime: Welcome back to WebRTC Live. I’m your host, Aaron Sine, founder and CEO of WebRTC Ventures. WebRTC Ventures is a custom design and development agency focused on building live applications. We’re here to help you take your action live. You can learn more about us, WebRTC.Ventures. Thanks for joining us live today on YouTube, LinkedIn, and Twitch. Welcome also to any of you watching a replay of today’s event. For those of you with us live, you can ask questions throughout the stream by leaving a comment in whichever platform you’re viewing this. So for today’s conversation, I am joined by three excellent panelists, all with a unique perspective on accessibility in software. So let me give you a brief introduction to each of them, and then they will share a little bit more about their work. Starting with Brian Bashan, vice chair of Be My Eyes, which is a free mobile app with one main goal, to make the world more accessible for blind and low vision people. The app connects blind and low vision individuals with sighted volunteers and companies from all over the world through a live video call. Brian joins us from California. Thanks for being here today, Brian.
[00:01:40 -> 00:01:42] Bryan Bashin: It’s a pleasure to join, Arin.
[00:01:43 -> 00:02:08] Arin Sime: It’s wonderful to have you. Thank you. Next up, we have Georg Sarid, CEO of SignTime. SignTime’s product, SignMax, is software for translating text into 3D animated sign language. A learning database in the background and sign language professionals ensure high quality translation, which is reproduced by a digital avatar. Georg joins us from Austria. Thanks for being here today, Georg.
[00:02:08 -> 00:02:09] Georg Tschare: Thanks for the invitation.
[00:02:10 -> 00:02:37] Arin Sime: It’s wonderful to have you here. Thank you. And finally, last but certainly not least, Alex Smihov, co-founder and CEO at WordCab. WordCab’s end-to-end AI suite enables teams to make sense of their unstructured communications at scale, and their team is now working on their first consumer-facing product, an AI-powered transcript editor. Alex joins us from Pennsylvania, snowy Pennsylvania, in fact. Welcome, Alex.
[00:02:36 -> 00:03:36] Aleks Smechov: Nice to be here and excited for the talk.
[00:02:36 -> 00:03:36] Arin Sime: Absolutely, yeah. Great to have all of you here. Thank you, everyone, for joining me today. So let’s start with a little more information on each of your companies. So we have a little video to play for each of you, for our viewers to get a better sense of the work that you do, and then you can talk a little bit more about the work that you do. So we’ll start with Brian Bashens and Be My Eyes. This is a video from Lucy, a TikTok influencer who’s blind, giving us a little demo of using Be My Eyes, the AI product, and a really fascinating little demo here. So let’s share that.
BeMyEyes Demo Transcript:
I’m in a Chinese supermarket. I’m gonna use AI to translate this label for me, and tell me what it is. Take picture, buddy, take picture. Right question here. What is this? Question mark. Virtual volunteer, hi. It appears to be a bottle of hua shao qiu, a Chinese rice wine. Is there anything else you would like to know about it?
It translated it, that’s so cool.
What recipe can I use it in? Question mark. Virtual volunteer, hua shao qiu can be used in many Chinese dishes, including stir fry, braised dishes, and soups. Some popular recipes that use this wine are drunken chicken, Shaoxing braised pork belly, and hot and sour soup. Would you like me to search for a specific recipe for you? Oh, you’re so Okay, I’m gonna ask it about the drunken chicken. Could I have the recipe for drunken chicken? Question mark. Here is a recipe for drunken chicken. Ingredients, one whole chicken, two cups of hua shao qiu,
One cup soy sauce, one cup sugar, one tablespoon ginger, mint, two cloves garlic. I’m buying this, guys, this is really cool. I’m gonna use vegetarian chicken though, don’t worry.
[00:04:19 -> 00:04:32] Arin Sime: Brian, I love Lucy’s expressions in that video and just the emotions that she’s sharing with us about the impact of Be My Eyes. I’ve watched a few other videos that she has. Tell us a little bit more about what we just saw there.
[00:04:32 -> 00:05:17] Bryan Bashin: Yeah, it’s been stunning. Well, we’ve had for eight years, we’ve had more than 7 million human volunteers being able to answer questions. In March of last year, we married OpenAI with our platform and that’s what you just saw. And so many people are interested in using AI description instead of bothering somebody. Maybe they think that’s a too trivial a project or maybe in this case, they want more than a random volunteer can provide like a recipe or some kind of context. So in the first month that we set up OpenAI with our system, we had 1 million questions, 1 million queries of it from the blind community.
[00:05:17 -> 00:05:33] Arin Sime: Awesome. Well, we’re going to talk more about the impact of AI and accessibility in this and here are the AI version of Be My Eyes. So let’s go to Georg and I’m going to share this video demo of SignLanguage.
[00:05:38 -> 00:06:23] Georg Tschare: Yes, what you see here is a SignLanguage avatar signing some information. This is a very neutral avatar we have. You see in the background how we create it. There’s a lot of handwork needed for that avatar types. We have a lot of use cases. We translate videos and films. We translate package leaflets of medicines. We translate content. We translate museum information, other patient and other information for kids and everything else what deaf people don’t access because they are very limited in understanding text information and of course they don’t hear and cannot hear the spoken.
[00:06:26 -> 00:06:39] Arin Sime: That’s really cool. Primarily for recorded videos at this point, the avatars or we can feed any text to it?
[00:06:39 -> 00:07:33] Georg Tschare: Oh, it depends on the use case. So some use cases, we have the technology to translate fully automatically. So we can, for example, input weather warnings or other kinds of warning information. So this very highly standardized content can be translated fully automatically or for example, a loudspeaker announcements in public transport systems. We run such a system at the Vienna public transport companies. For this case, we can translate in real time and fully automatically, but most of the information we translate, we need a semi-automatic translation system because natural language cannot be fully automatically translated. Not yet. We work on that and I’m sure within the next years, we are on a point where we can really translate everything fully automatically at very low costs and at very big amounts.
[00:07:34 -> 00:08:02] Arin Sime: Fascinating, I love it. Yeah, I think there’s so many interesting applications of and so much potential of avatars in communications in general, but just in this specifically for sign language and for accessibility is really fascinating to me. So I’m definitely looking forward to learning more from you about that in this episode. Thank you, Gyor. All right, Alex, let’s go next to a demo of a Wordflow.
WordFlow Demo Transcription:
Welcome to Wordflow by Wordcab, where our core value proposition is to save transcription editors time getting to final perfect and dealing with difficult audio and dealing with difficult terminology. So for basic edits, he says, technically, I do have a boring company. So we’re gonna go ahead and do an edit on that. I can either hit R on my keyboard or hit edit here. R, I’m gonna do a select all. Technically, I do return and that should fix it.
Actually, technically, I do have a boring company.
Wordflow will also have AI features, automatic table of content creation and editing. We have a find and replace capability. So you select search, you have find, and then once you find words, you can strike through them or strike through all of them and then find and replace. So let’s do an example of find. Let’s just look for Elon Musk and go ahead and strike through all of those. This will be deleted and the timestamps will also be reset as well.
[00:09:02 -> 00:09:13] Arin Sime: Alex, tell us a little bit more about what we just saw there in the demo of Wordflow. Sure.
[00:09:13 -> 00:09:35] Aleks Smechov: Just to back up a bit, we’re a voice intelligence company that creates speech to text and summarization tools. And we’re releasing our first consumer product at the end of this month, a supercharged transcript editor called Wordflow, which is what you just saw in the video. And it’s designed to be the fastest and most accurate transcript editing experience ever. Potentially could be used for closed captions.
[00:09:39 -> 00:11:17] Arin Sime: And we’ll talk more about use cases for that and things like closed caption in there. So looking forward to also learning more from you, Alex. Thank you. So we’ll start to talk about, let’s a little bit first with kind of the need for accessibility and accessibility work in application development and in our company at WebRTC Ventures. We definitely have exposure to this topic in our team, but I think to be frank, not enough. I mean, I think we have, there’s so much work that still can be done in applications that is not always done, starting with things like section 508 in U.S. law around, and this is an area we’ve done, work with multiple projects for state and federal agencies who are required to follow that to ensure accessibility. It’s not necessarily a video specific accommodation, more around accessibility work for screen readability, descriptive tags, contrast and color choices for low vision, et cetera. We’ve also done some work with a service called IRA.io, a visual interpreting service with on-demand trained interpreters for higher education and government buildings, more corporate scenarios that will fund that application. And then we’ve also done work in a number of different scenarios. This is probably the area we’ve done the most work, building video into call centers for sign language interpretation, as well as language translation scenarios where video is helpful. Doing this now for a client using pre-switched media server. We’ve used it, done it with other clients as well in the past using Vonage and other CPASs and open source media servers.
[00:11:17 -> 00:11:49] Arin Sime: So that’s definitely an interesting aspect of our work in building video applications where we get to be a part of improving accessibility. So we’ve had a good amount of exposure, but nonetheless, I think that too often software applications are not built with accessibility in mind. And so Brian, I’d like to start with you and have you tell us a little bit more about just the need for accessibility, the different types of disabilities that need to be considered in software development in general.
[00:11:50 -> 00:13:35] Bryan Bashin: Absolutely. Well, thanks. It’s a good question. I’m a blind guy myself. And so I live by that accessibility sword and sometimes die on it. So, you know, we’ve gone from webs and apps being novelties and then amazing things to where we are 30 years later. Rather than thinking about people and needs, think about this as a public utility. I mean, you don’t provide water to a city and say like, oh, somebody needs a lot of water or a little water. You just provide water. And so with accessibility, it’s the same thing. Of course, there are many different kinds of disabilities. I know particularly blind, people who are blind and low vision, there are 285 million of those on the planet, almost equivalent to the population of the United States. But there are others. There’s a large population of folks who don’t really use the current stuff because it’s overly complex or maybe cognitively complex. The so-called silver tsunami of older people who can’t use or find it non-intuitive, some of the software as well. So really we’re looking at ways in which we can make this just built in and just work all the time and not look at it as addressing a narrow need or a legalistic need, but just part of intelligent design that works on so many computer products now, consumer products, my iPhone particularly, Steve Jobs was really a pioneer in that. And it’s just assumed that if you get an Apple product, it’s gonna talk.
[00:13:37 -> 00:14:05] Arin Sime: Love it, I love that perspective. I appreciate you sharing that. Thank you, Brian. I think the most common thing that we see for deaf users is captioning built into videos and that sort of thing. But your work with 3D avatars, tell us a little bit more about the unique value that 3D sign language avatars provide over captioning.
[00:14:05 -> 00:15:56] Georg Tschare: Yeah, yeah. Yes, it’s evident that deaf people cannot hear spoken information texts. What is not so commonly known that deaf people have very bad access to written language, to text information. This is because of their deafness. The text is an abstract form of the spoken language. If you don’t have access to the spoken language, you never hear it. It’s very hard to understand the abstract form, the written form of the text language. And it’s very hard for deaf people to learn to read and to learn writing, very hard for them. And deaf people have a very low vocabulary. So they don’t, a lot of expressions they find in text information, they just don’t understand because they don’t know this word. So when we talk about deafness and we talk about those people which primarily use sign language as their first language, there are a lot of people which lose their hearing ability when they become older and something. They don’t have this reading problems because they are socialized in the text information world. They can read. But people which are born deaf or lose their hearing ability in a very young age, they never get access to text language. And therefore we talk about 70 to 80% of all deaf persons which are functionally illiterate. So if you provide them captures or any text information, it’s not accessible for them. The only way to get information accessible to deaf people is to provide sign language. And if you, maybe there’s one solution to provide a lot of sign language interpreters.
[00:15:56 -> 00:17:06] Georg Tschare: But if you expect the big amount of information, text information, which is daily produced in the world, you would need millions of interpreters and they are not available at all. So for example, there are nearly 5,000 books which are published every day. 5,000 books every day. And we talk about more than 100,000 newspapers which are published every day. So if you just make a big calculation, you say, okay, you need about 10 interpreters to translate one newspaper every day. So you need about 1 million interpreters just to translate the newspapers. And you’re far away of translating all the radio stations, the broadcast, the TV stations, and all the other text information which is produced every day new. So the only solution to make this whole information, this enormous textual information accessible to the deaf is a technical solution. And the only way to translate in a technical way are avatar systems, because you cannot handle this with humans.
[00:17:07 -> 00:17:33] Arin Sime: That’s really interesting. Yeah, I hadn’t considered the scope of the challenge, Becca, so I really appreciate that explanation of it. So how widespread do you envision 3D sign language avatars becoming? Is there any reason that they would not be used in all video, like live broadcasts as well as recorded broadcasts eventually? Should this be ubiquitous over time?
[00:17:32 -> 00:19:06] Georg Tschare: So I think in the long run, we will have all information accessible with sign language avatars. But we are not yet on the point, we are not technologically on the point where we can translate in real time natural language. That’s not possible. So there’s a lot of human work still needed to translate all this text information. We can do the fully automatically information translation with standardized texts, like loudspeaker announcements in public transport, which are always from the same form of information. So like, dear passengers, at the moment, the line B4 is broken between station A and station B, please move to line U6. So it’s a standardized sentence with variables in it, and you can change the sentence if you just change the variables, and then you can make all possible, all information which can be happen in the transport system, you can translate in the sign language, but you cannot do this with the daily news because you don’t know what kind of information you get developed at the next morning. So at the moment, we start at the small level where we can fully automatically translate, but it becomes higher and higher with every year of development because our systems become better and better all the time.
[00:19:07 -> 00:20:19] Arin Sime: Interesting, yeah. So this is in the scope of the problem, as you said previously, that we can’t have human sign language interpreters translating every book or every radio broadcasts. It’s not feasible. It’s not scalable. AI can certainly help us with text transcripts of those, but still being able to handle all of that at scale and based on natural language is an ongoing challenge. So yeah, that’s really interesting, Gyor, thank you. And so Alex, let’s move to you because now really the ball is in your court on this when we start talking about speech to text transcription, right, and kind of the work that you do at WordCab. So the most common thing we see around there is captioning on videos, the most common accessibility implementation that we see, but certainly it does still vary in quality on live video in particular because of natural language and because of ums and ahs and all those other things. Can you talk a little bit about the state of speech to text captioning on videos?
[00:20:20 -> 00:21:19] Aleks Smechov: Yeah, sure. So most people who’ve been in online meetings on Zoom or Google Meet or seen a subtitled YouTube video have probably seen live transcription in action. There might be some latency issues, sometimes proper nouns or difficult words get chopped up, but for the most part, you can pretty much follow along. The quality of the captions will also get steadily better over the next few years and the latency issue will get resolved eventually. At least for the hard of hearing and deaf communities, the issue with captions isn’t just all quality or speed, it’s also that captions are A, pretty much a second or fourth language for people who’ve been deaf most, if not all their life. And Jorg knows much more about this than I do and can speak to it. And B, live captions don’t have the nuance that closed captions do, for example, in shows or movies, like describing background noises or the type of laughter coming from a character.
[00:21:23 -> 00:21:45] Arin Sime: So I think this gets to some of what you were demoing in the video with workflow and how transcripts have come a long way, but certainly still have imperfections. So can you talk a little bit more about the best practices for editing transcripts on recorded videos and sort of the challenges that you’re looking to solve with workflow?
[00:21:47 -> 00:22:01] Aleks Smechov: Yeah, so for workflow, we’re just trying to get to a near perfect transcript as soon as possible. Get people 90% of the way there so they can scale their efforts in transcribing videos, mostly for offline use cases though.
[00:22:02 -> 00:23:19] Arin Sime: But still near, well, still immediately. I mean, I’m thinking, for instance, of how LinkedIn handles this. When I post a video to LinkedIn, recorded video to LinkedIn, it will post it and it goes live immediately, but then there’s a way that I can sort of edit the transcript through LinkedIn a little bit. And it’s pretty good line by line. Let’s me, within a few minutes of posting the video, I can go back and make sure that the captioning on it is pretty good and edit that. But if I’m dealing with anything longer than that, that’s what I thought was really interesting about the demo that you showed in workflow about the ability to search and replace things in there, to just immediately take out things like ums and ahs that I certainly do a lot. So I could really see the power of that, the importance of that in making something more accessible. Obviously, the more useful we can make the captioning, the more accessible it is. Let’s go back to sign language a little bit. And as I understand it, there are many variations of sign language. How do you handle those variations? Does that affect the types of use cases that you work with?
[00:23:20 -> 00:25:14] Georg Tschare: Yes, there are really a lot of variations of sign language, especially in every country. There’s a national, at least one national sign language. And within these national sign languages, there are a lot of dialects. And the standardization of the sign language is not as far as the standardization of spoken languages. So there are also a lot of dialects in the United States, but there is something like official American or English, official English, and this is a standardized language and you can buy a dictionary or whatever, and you find all definitions of everything. So this standardization does not exist in sign language, but it becomes more and more, it evolves more and more. And globalization doesn’t stop at sign language. So also the sign language communities become more and more standardized because they have to, they communicate much more. And so the most popular dialects in sign language in each country becomes more and more something like an official national sign language. This is just because it is used. And there’s another trend which we find that worldwide, the American sign language, ASL, becomes more and more popular. There are a lot of countries worldwide which use ASL or an ASL-based sign language, especially in Africa. We have a lot of countries that have ASL-based sign languages but also if deaf people catch up as a foreign language, then they tend to use American sign language and not British sign language or French or whatever. They tend to use American sign language.
[00:25:14 -> 00:25:57] Georg Tschare: Why? Because on YouTube, on all media, the most popular sign language is American sign. So we expect that in the near future, that means about 10 years, more and more American sign language will become more and more important within the worldwide sign language. What we do now is if we produce for a certain country, then we go for the most popular sign language which is used in that country. This could be a dialect or it could be a form of sign language which is used in a special university or in a certain region which is very common and very popular within this country.
[00:25:59 -> 00:27:40] Arin Sime: Yeah, interesting. As you were saying that, it made me think about how, if I’m watching a movie on Netflix, I can choose which language I want captions in sometimes depending on the movie. You could presumably do the same thing and another benefit of sign language avatars in content is that potentially you could have multiple options for the avatars to use American sign language, French sign language, et cetera, different options. So another benefit of the potential for the scalability of doing this with avatars. Yeah, it’s really interesting. Brian, let’s go back to Be My Eyes which is focused on providing assistance in physical world scenarios like shopping. We saw Lucy do in the video that we looked at at the beginning and helping people to look at inaccessible items like product labels, screens, et cetera. I’m curious what you think about the metaverse and virtual worlds that get a lot of conversation at the tech conferences that I go to and potential for incorporating video and audio communications into metaverse and all of us having her own avatars in that. I mean, what do you see as the, what do you imagine is the future of accessibility in a world like that where communication applications may become very visual again and sort of people sitting around a virtual conference table, bringing hand gestures and facial expressions into that. How do you anticipate accessibility being worked into that virtual world?
[00:27:41 -> 00:29:57] Bryan Bashin: Aaron, it’s such a good and dreaded question. The proliferation of 48 megapixel cameras and gigabit fiber and amazing screens, this is kind of unstoppable because we have all the elements, it’s going to happen. But in the blind community and some other communities, we are worried about being ostracized. I mean, take an old technology like a PowerPoint slide. If I’m in a classroom or in a Zoom meeting, somebody shares a PowerPoint, the natural thing is for people just to look at it. And in reality, you have to stop and say, could you please read the points in the PowerPoint because it’s not being spoken? And that’s the old technology. Think about the new technology where everything is just so, screens and whiteboards and sharing this and that, it will be a massive shift from the world of the word to the world of the picture. And what that means for blind people, unless we are intentional about it, is ostracism, social distance, not being there, but not really being there, not being able to know the cues. As we start having a cultural shift, that’s dialogue that’s largely visual. And so these cute things about waves of hands and all these gestures and all of that will not be accessible to us. And that will become the lingua franca of business meetings, of schools and of social platforms. Now, there are solutions to some of this. I mean, for instance, I was intrigued by the guest just a moment ago, real-time video description of things like, oh, he smiles, she nods her head, things like that.That could help. Spatial audio could help. But I really am worried that as we go to a visual culture, all this tech is enabling, people who can’t see in this case will be left out of this site-dominated new platform.
[00:29:57 -> 00:31:11] Arin Sime: Yeah, no, I think it’s a very legitimate concern, unfortunately, and I think this would take too much for granted from our own life experiences as we build these applications. And so it’s important to keep that in mind. One thing, as you were talking, I was thinking about that hopefully would be helpful in making these more accessible, but this does require us to be thoughtful about it and kind of plan for this, is that if, say in a video call like this, if you were to know that I’m waving my hands around or I’m smiling, right, we would need some video-based AI to analyze my video and to provide a cue that Aaron is smiling, Aaron is waving his hands, right? In theory, in a metaverse scenario, that should actually be easier to do because we don’t need an AI to interpret what the avatars are doing so much because they’re essentially coded objects. And so if we’re building hooks in for accessibility into a metaverse type of scenario, hopefully we can know that programmatically and share that for accessibility easier, but someone has to think to do it.
[00:31:12 -> 00:31:37] Bryan Bashin: Exactly, like for instance, in a Zoom meeting now, if somebody’s screen shares, you know that screen is inaccessible, so I can’t read any of those screens. And so if you sort of multiply that experience daily with every work group and every media interview and everything that I do, think how that feels after a while. It’s disabling technology and unnecessarily so.
[00:31:38 -> 00:32:17] Arin Sime: Yeah, no, it’s a very valuable, appreciate you sharing it, Brian. I wanna talk about one other thing in Be My Eyes real quick and then we’re gonna move on to more discussion of AI and accessibility. But you had mentioned something to me prior to us recording that I thought was really interesting about the need for family and friends groups in Be My Eyes, I think it’s above. I think you said something along the lines of this being particularly important in the global South and cultures where there’s different aspects of how they view accessibility or talking to strangers. I’d love to hear you talk a little bit more about that.
[00:32:18 -> 00:34:19] Bryan Bashin: Absolutely, well, you know, we have 7 million volunteers on the platform with Be My Eyes. And as maybe American culture, I think it’s a great adventure to get a random person to help me read a setting on my washing machine or whatever it is. But I recognize that’s a cultural characteristic of us. In so many cultures, you don’t talk to strangers. You talk to people that you know. And the idea of, you know, encountering somebody in the street or the digital street is a wary kind of thing. So we’ve just launched in December the ability for people to form their own trusted circles of folks. And you can do up to five different trusted circles. So say for instance, a blind person has 20 friends who are really good with fashion. Okay, so you create your fashionista circle when you’re getting ready in the morning, you just call that circle, the first one who’s available picks up and then you choose an outfit, that kind of thing. Similarly, in the global South, you get somebody with your language, with your culture, whether it’s your friend or your family member, and that allows larger sort of world penetration. We have almost 600,000 blind users now with Be My Eyes, but we think the market is much larger than that. It’s a free app. This isn’t about merchandising, but we need to be culturally sensitive. And this is one of our responses to it. Yeah, no, it’s really interesting example to me, the importance of keeping culture in mind with accessibility and similar to what Georg was saying about the differences in sign language across so many countries.And so we have to think about not only accessibility, but also combined with cultural sensitivity. And it also reminds me of conversations I’ve had with some people in the AI space, talking about how in certain scenarios,
[00:34:19 -> 00:34:54] Arin Sime: Like in maybe a chat bot around a healthcare scenario, you may actually be, or excuse me, this was a debt collection scenario. It was really interesting. This person talked to me about how in a debt collection scenario, you might be more comfortable speaking with an AI than a human because you feel less judged by the AI if you’re dealing with a debt that you owe still. And so some interesting cultural aspects there too, where actually AI provides some benefits to the conversation.
[00:34:54 -> 00:35:10] Bryan Bashin: Yeah, we’ve heard early on when we turned on AI, there are a lot of uses for AI in the kitchen with blind folks, but some folks thought maybe their kitchen was a little messy and they didn’t want a stranger looking and judging them, but AI just tells you what it sees.
[00:35:10 -> 00:35:33] Arin Sime: Well, that would be the case with me. And I think AI would tell me I have a mess in the kitchen, but no, I like it. So speaking of AI, let’s talk a little bit more about that specifically. And Alex, I wanna go to you first and talk about, have you talked a little more about the role that AI can provide with improving accessibility of voice conversations?
[00:35:34 -> 00:37:16] Aleks Smechov: Yeah, sure. And by the way, Brian, very fascinating points about culture. Just on a personal note, my wife is Vietnamese and Vietnamese people are very tight knit. They rely on family members a lot. They live in the same household. So it totally makes sense versus Americans, which are a lot more open to talking to strangers and having adventures. But just to go back to your question, we’ve actually tackled the closed caption approach to live transcription internally, which is more suited for hard of hearing folks. So for example, for one project, we built an audio classifier that’s uniquely suited for phone conversations. It works alongside a speech to text model and is able to detect when a phone is ringing, when someone is laughing, if there’s background music or background noise, if there’s a dial tone. And of course, if there’s direct speech. The first thing we do when we start a project is look at what the open source community has created. And surprisingly, there was no audio classification models for phone conversations specifically. So we just trained our own. Some examples where this would be useful are if you’re calling a customer support rep and are using a phone with captions built in, it will be hard to tell when the phone’s ringing or has a busy signal, or if there’s hold music. And if you’re just talking with a friend, phrases like, are you really going to wear that? And you’re hilarious, take on totally different tones when not accompanied by laughter or laughter captions.In these cases, audio classification would add much needed context.
[00:37:18 -> 00:37:48] Arin Sime: That’s really interesting and tremendous value of AI. Brian, I wanna go back to you about AI. So Be My Eyes is a combination of human volunteers and video chat. And then also scenarios like we showed in the video at the beginning of where it’s AI driven without a human volunteer involved. So just tell us a little bit more about that AI feature for things like photo descriptions.
[00:37:49 -> 00:39:00] Bryan Bashin: Sure. Well, it turns out that blind people, just like everybody else are in social networks, wanna share photos and get people’s photos and want them described. And now with AI, you could just take any photo in your camera roll and just have it described, or if you’re a instructor or university lecturer, you can have all your video. Turns out that the photographs in say a PowerPoint presentation, all tagging those is really a chore for academics. Do I have the right language? Did I get the things? You can have an AI alt tag. And that might be a hundred photos in an hour long lecture. It’s a great convenience. So we’re finding that access and the ability to comment and share, in the few months since we’ve switched AI on, there’s been a profusion of blind people who post social networks using the AI photo description so that their other blind friends can see what they’re talking about. So it’s a whole meta level of description there.
[00:39:01 -> 00:39:17] Arin Sime: Yeah, yeah. Even just looking through my photos on my iPhones, it’s interesting how well AI started to categorize things from search photos that I took with something in it, certainly location tagged, but also just things about the photo itself.
[00:39:18 -> 00:40:01] Bryan Bashin: But I will say that AI, of course, in this infancy doesn’t always get things right. So I’ll give you a little preview of what’s coming with Be My Eyes. We have a double check feature where we could submit the AI photo description to a bunch of volunteers who will say yes or no quickly within just seconds so that the user can get a reliability index on that, or even ask like what’s wrong with things. Because we’ve all seen the hallucinations and sometimes you want to really be sure of something. So we have the ability to add humans in the loop. That’s coming in the next few weeks.
[00:40:00 -> 00:40:41] Arin Sime: Here, let’s go back to the avatars. And one thing’s around AI that’s definitely interesting to us in our work are things like chatbots, and chatbots prior to a healthcare visit for revisit screening, or chatbots around customer service. It’s an important part of the work that we do with context centers and things like that. Your avatars are primarily intended to be fed like a text transcript, right? Is there any difference functionally? I mean, do you see them being used in the future also with like AI chatbots for customer service, for example?
[00:40:42 -> 00:42:38] Georg Tschare: Yes, sure. Indeed, we do. And for our technology, it’s not relevant whether the information comes from any kind of text input or from a chatbot. The chatbot also uses text and provides us the text. We translate the text into sign language. But the big advantages of the chatbots and all these new AI technologies that we can use this LLM models especially, this is LLMs, to translate a not standardized text into standardized text. Because from standardized text, as I already described our system from the loudspeaker announcements, we have standardized text and standardized text can much better translate it fully automatically into sign language. And therefore, we look at technologies like Alex described to us, where the technology not just translates spoken text into captures, but also describes the other sounds you hear around. And this is also very valuable information for us because if I know, okay, there’s a scene and in the scene, a two person talking in a dialogue and we also have a sound which is important to describe the scene and we can get this information in a standardized way, then we can translate it automatically into sign language. And for example, we just delivered a proof of concept last December, just a few weeks ago. We delivered the proof of concept where we had a fully automatic translation of the, not a translation, it was a description of a soccer game into sign language. And there’s a lot of AI in the background. So the AI produced standardized information.
[00:42:38 -> 00:43:22] Georg Tschare: For example, there’s the player Maya runs to the goal and then he passes to another player and so on and so on. All this information can be captured by engines and can be provided as standardized text. And if we get this standardized text from the AI, then we can fully automatically translate it into sign language. So this is very, very important. This is very new technology, just works since I think a few years. And we try to adapt all this technology to make it much easier, better, faster, and cheaper to translate, to produce the sign language.
[00:43:23 -> 00:44:49] Arin Sime: That’s interesting. It’s, you made me, I love the, this applying to other, you know, things like sports too. It’s not just, obviously this is about every aspect of life, having accessibility, not just getting around the tram, anything like that, right? And you remind me of a very touching, I’m a big soccer fan and there was a very touching video that went around a while back of a blind gentleman at an English soccer match who goes to every match and has his same friend who’s with him every time, who’s relaying, basically describing to him what they’re witnessing at the game. And very sentimental video, but, you know, also drives home the importance of accessibility in all aspects, in all situations in life. So I’ve really enjoyed this conversation. We’re a little bit over the time I told you. I hope you have just a few more minutes to share with our viewers. I have just one or two more questions here for you and it’s been a wonderful conversation. So on the technical architecture side, and then we’ll just have a couple of closing remarks from each of you before wrapping up here. Brian, I think, you know, one thing I wanted to ask you about, I know that Be My Eyes is a mobile app available on iOS and Android. I’m curious about the role for wearable devices in the future. Do you see that being an important part of Be My Eyes or accessibility in general?
[00:44:50 -> 00:45:47] Bryan Bashin: Hugely, and it’s coming, and it’s coming very soon. Several different platforms. Imagine if you have this great AI or human help there, but you’re holding something like a potato, when you need both hands to like drill a hole in the wall or use a bubble level or all these things. So it’s hands-free that’s really important to us as we go through life. And there are some mass market products that are just being released now. And we are partnering or in process of partnering with several of them. This will be the solution for many, many things. I mean, imagine I’ve got my iPhone and I want to volunteer to look at it. You know, you can’t do that, but with a wearable, you can. And so it is just one more iteration of what’s gonna give us more power and more elegance in fixing things around our world.
[00:45:48 -> 00:46:26] Arin Sime: Yeah, I look forward to seeing the progression of that technology. It’s something we experimented with many years ago when Google Glass first came out and the technology just wasn’t quite there at the time to try and do like a live WebRTC-based video chat over Google Glass at that time. You practically burned the battery pack off of your belt for that power consumption that was needed. So as the technology continues to improve, hopefully all of these use cases become more accessible. Georg, on the technical side, what are some of the unique technical challenges that you’ve run into with your sign language avatars?
[00:46:28 -> 00:48:17] Georg Tschare: Yeah, I think the biggest challenges are the expectations of our customers because nearly every day we get some requests from somewhere in the world and customers ask for translational systems, especially for live translation of events or whatever. And they expect that we deliver something like Google Translate for sign language. And we are a bit away, we are far away from that level of technology. There’s a lot of reasons. One of the reasons is that we don’t have this amount of digital content in sign language to train AI engines and something like this. So Google Translate and other translating engines, they use millions of pages and books in different languages. And in this way, you can train a system to translate from one language into the other. In sign language, we don’t have this content. So we have to go different ways. We have to go semantic ways of translation instead of statistical ways. So this is the biggest challenge we have. Another challenge is that when we translate text into sign language, then it has to be compared to translate a written language into a spoken language because sign language has to be compared to spoken language. There’s a lot of facial expression, a lot of body movement, and a lot of emotional information which comes with the sign language, which you do not find in the text version of the same information. So this is another challenge we have because if we don’t care about this, the sign language becomes very boring and nobody wants to look at such a sign language for a longer time.
[00:48:17 -> 00:48:58] Georg Tschare: So we have to add this additional information. And at the moment, it’s very hard to do this automatically. But with the technologies, especially with the technologies like Alex told us because his software also describes not just what is a person talking, but even if he’s smiling or not. And this is very important to make a sign language information out of it. And so we are very optimistic that we can reach our next goals within the next one or two years where we can use this very valuable information for us as input for our sign language translation.
[00:49:00 -> 00:49:50] Arin Sime: Interesting. And yeah, Alex, speaking of like those cues and stuff, and you spoke to this a little bit earlier, but I’m curious to know when you’re tuning models, I mean, how far away are we from a transcript detecting really the emotion of things that someone is saying in a conversation beyond just, I see some like sentiment analysis and context center tools now, for example, but I haven’t seen a transcript that said something like sarcastic laugh, which is maybe my most common type of laugh versus belly laugh. So what are some of the technical challenges with something like capturing things like that, as opposed to a dial tone, for instance, that might be a more standard sound?
[00:49:51 -> 00:50:51] Aleks Smechov: Right, great question. Well, if you have enough training data, stuff like detecting a sarcastic laugh versus a belly laugh, isn’t much of a challenge. The challenge is much more gathering that data in the first place, because you’re gonna need to gather examples of all types of belly laughs and sarcastic laughs from all types of voices. So you’re gonna need variations on each laugh if you wanna get really accurate. You’ll also probably need to analyze those generated captions in that moment to figure out the context as well, right? So not just the audio, but the text around what the person is saying. So that would be another factor in deciding that type of laugh, whether it’s jovial or sarcastic. And on top of that, you’ll need to distinguish between a speaker’s laugh or if it’s their family member right beside them watching Netflix. So in that particular case, if you see a caption that says background TV and background voices with laughter, that’s pretty helpful.
[00:50:55 -> 00:51:52] Arin Sime: That’s interesting. Yeah, the context is I mean, this is so much more complicated than just matching audio waves to each other of what does a laugh sound like when you have to try and capture that full context. So, well, thank you all very much. This has really, really been an interesting conversation. You’ve definitely widened my world perspective a lot through this. So I thank each of you for that. And I think for those of you watching, listening to this episode of WebRTC Live, and it’s January in 2024, as we’re recording this, if you’re still looking for a new year’s resolution, how about adding some more accessibility into your application would be a good one. And I’ll take that as a resolution myself as well. I’d like to ask each of you to just share quickly with our viewers, how can they get more information about your companies and the work that you do? And I’ll start with you, Brian.
[00:51:53 -> 00:52:14] Bryan Bashin: Sure, our website, www.bemyeyes.com. And on both Google and the Apple side, in the app stores, you can find the app. You can volunteer, or if you know a blind person who can benefit, they can sign up as well and start using the app immediately, free of charge.
[00:52:15 -> 00:52:16] Arin Sime: Thank you, Brian. Georg?
[00:52:18 -> 00:52:42] Georg Tschare: Yeah, the best way is to reach out our website, www.cymax.media, or if you want to get detailed information about a special use case, your best is to contact me directly on my mail address. And you’ll find this also on our website or just write to office at signtime.media.
[00:52:44 -> 00:52:45] Arin Sime: And Alex?
[00:52:46 -> 00:53:08] Aleks Smechov: Sure, if you want to find out more about our work in our company, you could visit wordcab.com. And if you’d like to chat about anything specific, audio classification, transcript editing, feel free to email me personally. My email is on the website, or I can tell you right now, Alex at wordcab.com. That’s Alex with A-L-E-K-S at wordcab.com.
[00:53:10 -> 00:53:16] Arin Sime: Thank you, Alex. Thank you, Georg. Thank you, Brian. I really appreciate all of you taking the time today to share your expertise with us.
[00:53:18 -> 00:53:19] Aleks Smechov: Pleasure. Thank you.
[00:53:21 -> 00:55:01] Arin Sime: All right, and thanks to all of our guests for an enlightening discussion. As always, this video will be available on the WebRTC Ventures YouTube channel, as well as on our blog at webrtc.ventures. This Friday, we have a special presentation on migrating away from Twilio Video, which will be shut down in December of 2024, as you may know already. So we’ll be talking about that this Friday, January 12th at 12 p.m. Eastern. And so you can join us for that. Our team will cover open source and CPaaS alternatives to consider, other architectural and business decisions to consider as you plan your migration away from Twilio Video before December of 2024. So don’t make any architectural decisions until you see this special webinar. You can register now at webrtc.ventures slash Twilio dash video. Look forward to seeing you there. Our next episode of WebRTC Live will be on Wednesday, February 21st with Dan Nordale from Symbol AI. So we’ll be talking more about AI in communication applications. Find out more about any of our upcoming episodes, follow us on social media at WebRTC Ventures on Twitch, LinkedIn, YouTube, and join our email list at webrtc.ventures. So thanks everyone for joining us today. Let’s make it live. Thanks for joining us for WebRTC Live. Visit our website at webrtc.ventures to learn more about our custom design and development services and to learn more about upcoming episodes of WebRTC Live.