How to Build a Translator Bot Using Web Speech API and GPT-3

In the previous posts of our Polybot.ai translator bot (“Polly”) series, we looked at AI + WebRTC product development, brand creation and UI, and also how to successfully build prompts for interacting with the Large Language Model (LLM)

In this final post, we will take a look at the technical steps for integrating all of Polly’s pieces together using Web Speech API and one of OpenAI‘s ChatGPT underlying models: GPT-3, in order to add voicebot Generative AI capabilities to your web application.

Prerequisites for Building our Voicebot

  1. An OpenAI API Key (account required)
  2. A backend application that manages authentication for the client. My colleague Hamza Nasir shared an example of such a backend when explaining how to integrate WebRTC applications with GPT-3.

Keep in mind that this example is not suitable for production unless you ensure to include proper authentication.

Step 1: Transform Speech-to-Text using Web Speech API’s SpeechRecognition

The Web Speech API allows you to add voice capabilities to web applications. Its two interfaces: SpeechRecognition and SpeechSynthesis, provide basic Speech-To-Text (STT) and text-to-speech (TTS) functions, respectively. We take advantage of this to get the transcription of the user’s audio stream. 

First, we get access to local media devices using the getUserMedia method from WebRTC’s Media Capture and Streams API. It’s important to set only the video constraint to true, as in this case we’re only interested in the video. We will manage audio through Web Speech API.

Then, we add the video stream to the UI, and initialize the SpeechRecognition object instance. We set the interimResults and continuous properties of such an object to true in order to get transcriptions as they are being generated.

// get the user's video stream
navigator.mediaDevices
    .getUserMedia({
      audio: false,
      video: true,
    })
    .then((stream) => {
      // add the video to the UI
      const localVideo = document.createElement('video');
      localVideo.id = 'localVideo';
      localVideo.autoplay = true;
      localVideo.srcObject = stream;
      localVideo.style = 'width: 100%; border-radius: 12px';
      publisherDiv.appendChild(localVideo);

      // initialize SpeechRecognition instance object
      recognition = new webkitSpeechRecognition() || new SpeechRecognition();
      recognition.interimResults = true;
      recognition.continuous = true;

      // set an event listeners for speech recognition
      recognition.onresult = (e) => {
        // more code will be added here to manage transcriptions
      }
    };
}

In order for this to work, we also need to tell the SpeechRecognition service to start listening for incoming audio. To do so, set the language property to the one you will be translating from and call the start method, as follows:

// a function for start listening incoming audio
async function startCaptions(language, translator_language) {
  try {
    speakSythLange = translator_language;

    recognition.lang = language;
    recognition.start();

    captions = true;
  } catch (error) {
    handleError(error);
  }
}

Step 2: Send Captions to GPT3

The next step is to send the captions to the GPT-3 Large Language Model in order to get their translations.

To do so, let’s revisit our captionsReceived event handler and check whether we finished receiving transcriptions for a sentence, and if so, start the AI generation function. We do this by checking the <em>final</em> attribute of the SpeechRecognitionResult object like this:

recognition.onresult = (e) => {
  // we get the SpeechRecognitionResult object
  const result = e.results[e.results.length - 1];
  // we get the transcript from the object
  const transcript = result[0].transcript;

  // we check if we finished receiving transcriptions
  if (result.isFinal) {
    // if so, let's call AI generation function
    startAiGenerator(transcript);
  }
};

Now let’s define our startAiGenerator function. In this function we send the prompt you built in the previous post, along with the transcript as system and user messages to GPT3, respectively. Then, we take the response from the model and convert it into something utterable by SpeechSynthesis. Let’s focus on the former first.

We start by defining a messages array containing your prompt as a system message, and add a new user message containing the transcriptions. Next, we build the request to OpenAI by passing the array, and other LLM related attributes such as temperature and model. One attribute to pay attention to is stream, which tells the model to send the response back as it is being generated. This will be important later.

// define an array containing the prompt as system message
const messages = [
  {
    role: 'system',
    content: "Your Prompt here"
  },
];

async function startAiGenerator(message) {
  // build a new user message
  const userMessage = {
    role: 'user',
    content: message,
  };

  // build the GPT3 request
  const reqBody = {
    messages: [...messages, userMessage],
    temperature: 1,
    max_tokens: 256,
    top_p: 1,
    frequency_penalty: 0,
    presence_penalty: 0,
    model: 'gpt-3.5-turbo',
    // tell the model to stream back the response as it is generated
    stream: true,
  };

  // send the request to GPT-3 REST API
  try {
    const response = await fetch(
      'https://api.openai.com/v1/chat/completions',
      {
        headers: {
          Authorization: `Bearer ${openAISecret}`,
          'Content-Type': 'application/json',
        },
        body: JSON.stringify(reqBody),
        method: 'POST',
        signal: abortController.signal,
      }
    );
  } catch(e) {
    // handle error
  }
}

Now let’s take a look at the second part of the function in which we handle the response from the LLM. As mentioned before, the response is being streamed to the client as it is generated in the model so we need to handle this using a reader and a decoder in order to parse the translation.

Then, we use a while loop to go through the whole text, which is concatenated in a utterableText variable until we reach a separator such as a dot, a comma, or an exclamation or question mark. When that happens we call the speakText function that utters the text to the user and displays it on the screen. We will check such a function later.

try {
  // previous code…
  
  // we create a variable to concatenate the text from the response
  let utterableText = '';

  // define a reader and decoder for the stream
  const reader = response.body.getReader();
  const decoder = new TextDecoder('utf-8');

  // use a while loop to go through the response
  while (true) {
    const chunk = await reader.read();
    const { done, value } = chunk;
    
    // when streaming stops we cancel the loop
    if (done) {
      break;
    }

    // we parse the translation from the LLM’s response
    const decodedChunk = decoder.decode(value);
    const lines = decodedChunk.split('\n');
    const parsedLines = lines
      .map((l) => l.replace(/^data: /, '').trim())
      .filter((l) => l !== '' && l !== '[DONE]')
      .map((l) => JSON.parse(l));
    for (const line of parsedLines) {
      const textChunk = line?.choices[0]?.delta?.content;
      if (textChunk) {
        utterableText += textChunk;

        // when reaching a separator we pass the text to SpeechSynthesis
        //   and clear the temporary variable
        if (textChunk.match(/[.!?:,]$/)) {
          speakText(utterableText);
          utterableText = '';
        }
      }
    }
  }
}
...

Step 3. Display Translation and Text-to-Speech

The final step is to read aloud the translation and display it on the screen. To do so, let’s take a look at the speakText function we called before when parsing the response from the model.

We also create a second function, displayCaptions, that simply creates a new DOM element for displaying the text. We include this other function as reference below, but you will likely want to create it based on your own application’s UI logic.

In the speakText function, the first thing we do is to create a new SpeechSynthesisUtterance instance. Then, we define its voice and <em>language</em> properties, and also set some event listeners in case there is something we want to do when the engine starts/stops speaking. 

Note that we set language to a variable called speakSythLange, which in our case contains the language that the user selected to translate to, in BCP 47 format. Again, be sure to adapt this to your application’s logic.

Finally we pass this new object to the browser’s SpeechSynthesis interface to start with the utterance.

// A function that displays the translation in the UI
function displayCaptions(captionText, className, container = document) {
  // gets ui element where message is added from the container
  const [subscriberWidget] = container.getElementsByClassName(className);

  // create a new DOM element for the message
  const captionBox = document.createElement('div');
  captionBox.classList.add('caption-box');
  captionBox.textContent = captionText;

  // add the message to the UI
  subscriberWidget.appendChild(captionBox);
}

function speakText(text) {
  // create SpeechSynthesisUtterance object
  const utterThis = new SpeechSynthesisUtterance(text);

  // configure the utterance object
  utterThis.voice = voices.find((v) => v.name.includes('Samantha'));
  // set the language you're translating to
  utterThis.lang = speakSythLange;

  // set event listeners in case you want to do something when
  //   starting/stopping utterances
  utterThis.onstart = () => {
    // do something when utterance starts here
  };
  utterThis.onend = function () {
    // do something when utterance stops here
  };

  // display and utter translations
  displayCaptions(text, 'ai-assistant-captions');
  window.speechSynthesis.speak(utterThis);
}

Unleashing the power of Generative AI in your real-time communication application

By following these three simple steps, you can empower your real-time communication application with the ability to leverage the power of AI, and take such communication to the next level by transcending language boundaries.

It all starts with getting audio streams for the user using the Web Speech API’s SpeechRecognition interface. Then you take such captions and pass them as messages, along with a well crafted prompt, to OpenAI’s GPT-3 LLM to get its translations. And finally, you take the output of the LLM and display it on the screen while having the SpeechSynthesis reading it aloud.

If you’re looking into unleashing the power of Gen AI in your real-time communication application, we have you covered! Contact us and let’s explore the possibilities. Let’s make it live!

Posts in this series:

Recent Blog Posts