Using LLMs to Evaluate and Improve Automated Transcription Quality

Transcription, a crucial component of modern contact centers activities, is largely facilitated by Automatic Speech Recognition (ASR) systems. These tools, however, can fall short in accuracy and reliability. Consequently, assessing transcript quality becomes imperative and has traditionally involved costly manual processes. Enter Large Language Models (LLMs), presenting a promising and efficient solution to evaluate and improve the quality of automated transcriptions. 

In this post, we explore how LLMs can be used not only to evaluate but also to improve the output of ASR systems, providing a pathway to superior customer service and operational efficiency.

How does Automatic Speech Recognition work?

Automatic Speech Recognition (ASR) is a machine learning (ML) technology that processes audio streams and generates a text representation of them.

This same ability to convert audio to text is also the base for other AI-based capabilities such as sentiment analysis or generating call insights and summaries. Therefore, most enterprise-ready contact center solutions that offer those features also include transcription as part of their offerings. One example is Contact Lens, which brings these features and more to Amazon Connect instances. 

If you’re not using Amazon Connect, or you’re interested in building a custom transcription mechanism for your WebRTC application, you can use Amazon Transcribe or a third-party tool like Streaming API to do so.

The overall process is as follows:

  1. You transfer your audio streams to the transcription service.
  2. The service responds with transcripts.
  3. Your application processes the transcripts and either stores it or feeds it into other processes such as translations or sentiment analysis.
Depiction of how contact center solutions get transcriptions from a transcription service.
Depiction of how contact center solutions get transcriptions from a transcription service.

Real-time and asynchronous ASR

Transcription is usually available in two different approaches: real-time and asynchronous. Although this will depend on the service you use.

  • A real-time approach consists of establishing a connection to the transcription service–usually using websockets–for sending audio chunks and receiving text phrases as they are generated. This approach is suitable for implementing other “in-call” features such as subtitles, captions and in-call agent assistance.
  • An asynchronous approach involves sending a call recording to the transcription service, which in turn responds with a transcription file. This approach is useful for compliance and generating post-call analysis and insights.

Factors that Influence ASR Accuracy

Automatic Speech Recognition accuracy can be influenced by various factors. Here are some of the key ones:

  • Language and accent diversity
  • Audio quality
  • Context and domain-specific terminology
  • Speaker diarization (who spoke when)

When you are first choosing an ASR system, it’s recommended to prepare a set of test data that resembles real-life scenarios that are inherent to your business so that you can evaluate the quality of the transcriptions under real life scenarios. 

For example, you can use recordings from previous calls where customers speak a different language, users call from noisy places, or the conversation includes medical or legal terms.

A manual transcription of these calls will be the ground truth or reference that you will use to test the quality of the automated transcription service.

Methods to Evaluate Transcription Quality

Once you have your test data ready, there are various ASR accuracy metrics that you can leverage.

One popular metric is Word Error Rate (WER), which calculates the percentage of incorrect word transcriptions in a given audio sample. WER is calculated by adding the number of insertion, substitution, and deletion errors and dividing by the total number of words in the ground truth transcription.

Another alternative is the NER model, which assigns different penalties to errors based on their severity, providing a more nuanced evaluation.

These methods involve human-verified processes, which has both benefits and drawbacks. Context-wise, there are benefits to having humans in the loop. On the drawback side, assigning penalties and scores can be subjective. This is generally also a labor intensive activity. 

Large Language Model (LLM)-based evaluation is an alternative that combines the benefits of human-like understanding of natural language with the flexibility of AI.

The Growth of Language Models

Language models are a type of machine learning that understands natural language and generates output in a way that resembles human understanding and expression. They acquire linguistic patterns and correlations through statistical analysis, then utilize this knowledge to produce cohesive text.

Over time, language models have expanded not only in the breadth of data employed for training but also in the scale of parameters they encompass. Think of a parameter as an internal dial that can be adjusted to change the way the model understands the input and generates the output.

As language models advanced, they transitioned into Large Language Models (LLMs), which demonstrate a deeper understanding of language nuances and complexities, occasionally outperforming humans in specific contexts. Some examples of LLMs are’s Nebula, OpenAI’s GPT-4 and Anthropic’s Claude 3

This combination of language understanding, contextual awareness, error detection capabilities, scalability, adaptability, and real-time feedback makes LLMs an effective tool for evaluating the transcript quality of ASR systems.

Writing LLM Prompts for Evaluating Transcription Quality

The essential step in harnessing the utility of an LLM for any task lies in crafting a well-constructed prompt. (My colleague Ana Saa wrote an excellent post on Prompt Engineering for an AI Translator Bot.) In this scenario, we need to engineer the right prompt to instruct it to evaluate transcription quality.

The first recommendation is to adopt a few-shot approach. This entails providing the LLM with examples of expected results. In contrast to a zero-shot approach, where the LLM relies solely on training data to produce results, a recent study demonstrated that few-shot approaches yield enhanced outcomes.

Another best practice involves incorporating chain of thought reasoning. This entails presenting not just examples but also the intermediate steps that lead to accurately assigning metrics for transcription chunks. Research revealed that adopting this approach enhances predictions and facilitates a deeper understanding of the model’s reasoning process. 

In the prompt below, I provide ChatGPT, running GPT-3.5, with a couple of examples calculating WER for a paragraph of Natalia Lafourcade’s “Busca un Problema” lyrics, transcribed using OpenAI Whisper

Reference: busca busca en tus zapatos ya
Transcript: busca busque en tu zapato o sea
* busque replaced busca
* tu replaced tus
* o replaced ya
* sea was added
Deletions: No deletions
WER: 3 Replacements + 1 Insertions + 0 Deletions = 4/6 = 67%

Reference: un cigarro tú lo puedes encontrar
Transcript: un cigarro todo puedes encontrar
* todo replaced tú
Insertions: No insertions
* lo was deleted
WER: 1 Replacements + 0 Insertions + 1 Deletions = 2/6 = 33%

Reference: busca busca a la persona que hará
Transcript: busca busca a la persona que haga
* haga replaced hará
Insertions: No insertions
Deletions: No deletions
WER: 1 Replacements + 0 Insertions + 0 Deletions = 1/7 = 14%

Reference: que tus sueños solo se hagan realidad
Transcript: que tu sueño solo te haga de arriba

Let’s take a look at the output from ChatGPT:

For the fourth set of lyrics, let's first identify the errors between the reference 
and the transcript:

Reference: que tus sueños solo se hagan realidad  
Transcript: que tu sueño solo te haga de arriba  

1. "tu" replaced "tus"
2. "sueño" replaced "sueños"
3. "te" replaced "se"
4. "haga" replaced "hagan"
5. "de arriba" replaced "realidad"

Insertions: No insertions  
Deletions: No deletions  

Count of errors:  
- Replacements: 5  
- Insertions: 0  
- Deletions: 0  

Total errors: 5

Calculation of Word Error Rate (WER):  
- Number of words in the reference: 7  
- Total errors: 5  

WER=Total Errors/Number of Words in Reference​=5/7​≈71.43%

So, the WER for the fourth set is approximately 71.43%.

As expected, furnishing a few examples and a comprehensive train of thought to derive the answer enabled the GPT-3.5 model to deliver a precise assessment of the transcription.

Agentic Workflows

Design patterns can also be integrated into Large Language Models (LLMs). Known as agentic workflows, these patterns encompass reflection, tool utilization, planning, and collaboration among multiple agents. They have been effectively employed in diverse applications, yielding impressive outcomes.

A recent analysis revealed that GPT-4 outperforms GPT-3.5 in zero-shot coding tasks, achieving a 67.0% accuracy compared to GPT-3.5’s 48.1%. However, incorporating an iterative agent workflow significantly boosts performance, with GPT-3.5 wrapped in an agent loop achieving an impressive 95.1% accuracy.

In the context of converting audio to text, this translates into the ability to pipe the outcome from the transcription service into one or multiple LLM models. These models perform improvements over the transcript, such as removing filler words, fixing grammatical errors, or adding domain-specific context, making the transcription more accurate.

This could be applicable in real-time transcription approaches, provided that there is an LLM capable of generating the outcome fast enough to be consumed by multiple models before showing the final text to the user.

Ready to Improve Your ASR’s Transcription Quality?

Integrating Large Language Models with ASR systems presents a significant opportunity for contact centers and other businesses that use these tools to enhance the accuracy and utility of their transcription processes. By taking advantage of their understanding of natural language, and empowered with a few-shot approach that includes chain-of-thought reasoning, businesses can achieve more reliable and insightful analyses of customer interactions, which in turn can enhance service delivery and compliance adherence.

Are you looking to implement transcription capabilities into your contact center solution? Look no further, our team of skilled developers is here to tackle any challenge and deliver a top-notch solution tailored to your needs. Whether you’re a small contact center looking to disrupt the market or an established enterprise seeking to enhance customer engagement, we’re here to help. Contact today and let’s make it live!

Recent Blog Posts