In the previous posts of our Polybot.ai translator bot (“Polly”) series, we looked at AI + WebRTC product development, brand creation and UI, and also how to successfully build prompts for interacting with the Large Language Model (LLM).
In this final post, we will take a look at the technical steps for integrating all of Polly’s pieces together using Web Speech API and one of OpenAI‘s ChatGPT underlying models: GPT-3, in order to add voicebot Generative AI capabilities to your web application.
Prerequisites for Building our Voicebot
- An OpenAI API Key (account required)
- A backend application that manages authentication for the client. My colleague Hamza Nasir shared an example of such a backend when explaining how to integrate WebRTC applications with GPT-3.
Keep in mind that this example is not suitable for production unless you ensure to include proper authentication.
Step 1: Transform Speech-to-Text using Web Speech API’s SpeechRecognition
The Web Speech API allows you to add voice capabilities to web applications. Its two interfaces: SpeechRecognition
and SpeechSynthesis
, provide basic Speech-To-Text (STT) and text-to-speech (TTS) functions, respectively. We take advantage of this to get the transcription of the user’s audio stream.
First, we get access to local media devices using the getUserMedia method from WebRTC’s Media Capture and Streams API. It’s important to set only the video
constraint to true
, as in this case we’re only interested in the video. We will manage audio through Web Speech API.
Then, we add the video stream to the UI, and initialize the SpeechRecognition
object instance. We set the interimResults
and continuous
properties of such an object to true
in order to get transcriptions as they are being generated.
// get the user's video stream
navigator.mediaDevices
.getUserMedia({
audio: false,
video: true,
})
.then((stream) => {
// add the video to the UI
const localVideo = document.createElement('video');
localVideo.id = 'localVideo';
localVideo.autoplay = true;
localVideo.srcObject = stream;
localVideo.style = 'width: 100%; border-radius: 12px';
publisherDiv.appendChild(localVideo);
// initialize SpeechRecognition instance object
recognition = new webkitSpeechRecognition() || new SpeechRecognition();
recognition.interimResults = true;
recognition.continuous = true;
// set an event listeners for speech recognition
recognition.onresult = (e) => {
// more code will be added here to manage transcriptions
}
};
}
In order for this to work, we also need to tell the SpeechRecognition
service to start listening for incoming audio. To do so, set the language
property to the one you will be translating from and call the start
method, as follows:
// a function for start listening incoming audio
async function startCaptions(language, translator_language) {
try {
speakSythLange = translator_language;
recognition.lang = language;
recognition.start();
captions = true;
} catch (error) {
handleError(error);
}
}
Step 2: Send Captions to GPT3
The next step is to send the captions to the GPT-3 Large Language Model in order to get their translations.
To do so, let’s revisit our captionsReceived
event handler and check whether we finished receiving transcriptions for a sentence, and if so, start the AI generation function. We do this by checking the <em>final</em>
attribute of the SpeechRecognitionResult
object like this:
recognition.onresult = (e) => {
// we get the SpeechRecognitionResult object
const result = e.results[e.results.length - 1];
// we get the transcript from the object
const transcript = result[0].transcript;
// we check if we finished receiving transcriptions
if (result.isFinal) {
// if so, let's call AI generation function
startAiGenerator(transcript);
}
};
Now let’s define our startAiGenerator
function. In this function we send the prompt you built in the previous post, along with the transcript as system and user messages to GPT3, respectively. Then, we take the response from the model and convert it into something utterable by SpeechSynthesis
. Let’s focus on the former first.
We start by defining a messages
array containing your prompt as a system message, and add a new user message containing the transcriptions. Next, we build the request to OpenAI by passing the array, and other LLM related attributes such as temperature and model. One attribute to pay attention to is stream
, which tells the model to send the response back as it is being generated. This will be important later.
// define an array containing the prompt as system message
const messages = [
{
role: 'system',
content: "Your Prompt here"
},
];
async function startAiGenerator(message) {
// build a new user message
const userMessage = {
role: 'user',
content: message,
};
// build the GPT3 request
const reqBody = {
messages: [...messages, userMessage],
temperature: 1,
max_tokens: 256,
top_p: 1,
frequency_penalty: 0,
presence_penalty: 0,
model: 'gpt-3.5-turbo',
// tell the model to stream back the response as it is generated
stream: true,
};
// send the request to GPT-3 REST API
try {
const response = await fetch(
'https://api.openai.com/v1/chat/completions',
{
headers: {
Authorization: `Bearer ${openAISecret}`,
'Content-Type': 'application/json',
},
body: JSON.stringify(reqBody),
method: 'POST',
signal: abortController.signal,
}
);
} catch(e) {
// handle error
}
}
Now let’s take a look at the second part of the function in which we handle the response from the LLM. As mentioned before, the response is being streamed to the client as it is generated in the model so we need to handle this using a reader and a decoder in order to parse the translation.
Then, we use a while loop to go through the whole text, which is concatenated in a utterableText
variable until we reach a separator such as a dot, a comma, or an exclamation or question mark. When that happens we call the speakText
function that utters the text to the user and displays it on the screen. We will check such a function later.
try {
// previous code…
// we create a variable to concatenate the text from the response
let utterableText = '';
// define a reader and decoder for the stream
const reader = response.body.getReader();
const decoder = new TextDecoder('utf-8');
// use a while loop to go through the response
while (true) {
const chunk = await reader.read();
const { done, value } = chunk;
// when streaming stops we cancel the loop
if (done) {
break;
}
// we parse the translation from the LLM’s response
const decodedChunk = decoder.decode(value);
const lines = decodedChunk.split('\n');
const parsedLines = lines
.map((l) => l.replace(/^data: /, '').trim())
.filter((l) => l !== '' && l !== '[DONE]')
.map((l) => JSON.parse(l));
for (const line of parsedLines) {
const textChunk = line?.choices[0]?.delta?.content;
if (textChunk) {
utterableText += textChunk;
// when reaching a separator we pass the text to SpeechSynthesis
// and clear the temporary variable
if (textChunk.match(/[.!?:,]$/)) {
speakText(utterableText);
utterableText = '';
}
}
}
}
}
...
Step 3. Display Translation and Text-to-Speech
The final step is to read aloud the translation and display it on the screen. To do so, let’s take a look at the speakText
function we called before when parsing the response from the model.
We also create a second function, displayCaptions
, that simply creates a new DOM element for displaying the text. We include this other function as reference below, but you will likely want to create it based on your own application’s UI logic.
In the speakText
function, the first thing we do is to create a new SpeechSynthesisUtterance instance. Then, we define its voice
and <em>language</em>
properties, and also set some event listeners in case there is something we want to do when the engine starts/stops speaking.
Note that we set language
to a variable called speakSythLange
, which in our case contains the language that the user selected to translate to, in BCP 47 format. Again, be sure to adapt this to your application’s logic.
Finally we pass this new object to the browser’s SpeechSynthesis
interface to start with the utterance.
// A function that displays the translation in the UI
function displayCaptions(captionText, className, container = document) {
// gets ui element where message is added from the container
const [subscriberWidget] = container.getElementsByClassName(className);
// create a new DOM element for the message
const captionBox = document.createElement('div');
captionBox.classList.add('caption-box');
captionBox.textContent = captionText;
// add the message to the UI
subscriberWidget.appendChild(captionBox);
}
function speakText(text) {
// create SpeechSynthesisUtterance object
const utterThis = new SpeechSynthesisUtterance(text);
// configure the utterance object
utterThis.voice = voices.find((v) => v.name.includes('Samantha'));
// set the language you're translating to
utterThis.lang = speakSythLange;
// set event listeners in case you want to do something when
// starting/stopping utterances
utterThis.onstart = () => {
// do something when utterance starts here
};
utterThis.onend = function () {
// do something when utterance stops here
};
// display and utter translations
displayCaptions(text, 'ai-assistant-captions');
window.speechSynthesis.speak(utterThis);
}
Unleashing the power of Generative AI in your real-time communication application
By following these three simple steps, you can empower your real-time communication application with the ability to leverage the power of AI, and take such communication to the next level by transcending language boundaries.
It all starts with getting audio streams for the user using the Web Speech API’s SpeechRecognition
interface. Then you take such captions and pass them as messages, along with a well crafted prompt, to OpenAI’s GPT-3 LLM to get its translations. And finally, you take the output of the LLM and display it on the screen while having the SpeechSynthesis
reading it aloud.
If you’re looking into unleashing the power of Gen AI in your real-time communication application, we have you covered! Contact us and let’s explore the possibilities. Let’s make it live!
Posts in this series:
- AI + WebRTC Product Development: A Blueprint for Success
- Developing a Brand Strategy and Identity for an AI-Powered WebRTC Application
- Prompt Engineering for an AI Translator Bot
- How to Build a Translator Bot Using Web Speech API and GPT-3