Embracing the Whisper: Unleashing the Power of Localized ASR with Whisper

Get featured on IndiaAI

Contribute your expertise or opinions and become part of the ecosystem!

Unfortunately, it is in Tamil, a language you are not familiar with. You really want to know the smart things this visionary person is saying, you turn to a hypothetical advanced ASR (Automatic Speech Recognition) technology similar to Whisper.

As Dr. Kalam speaks, the ASR app on your device transcribes his speech accurately in real-time despite the language barrier. It transcribes and translate Tamil into English text.

Thanks to this cutting-edge ASR technology, language is no longer a barrier to knowledge. You become an admirer of Dr. APJ Abdul Kalam's vision, all without needing to understand Tamil.

Welcome to the future of communication, where ASR, and especially Whisper, open a gateway to understanding and connecting in diverse and multilingual settings.

Downloading Whisper locally

Now, let's bring this marvel right into your own space. Whisper, the ASR model developed by OpenAI , can be nestled snugly into your local machine. Though it's made available by OpenAI, we'll favor Hugging Face for its seamless 'transformers' library integration, user-friendly setup, and Python compatibility.

First things first—we'll want to welcome the 'transformers' library into our coding family:

pip install transformers

With our new digital toolkit in place, we can invite Whisper into our home. It's as simple as summoning the model with a spell crafted for speech-to-text:

We will download the model using the class AutoModelForSpeechSeq2Seq which is used to initialize model and also to download it.

This will download the Whisper model directly into our device. But what if we want to keep it for the long time ? No problem! With the ‘save_pretrained’ charm, we can transfer into our desired directory for the future use.

Now, Whisper resides within your reach, ready to be awakened whenever you wish:

Processor is also required in speech to text. It can be called a forth from Hugging Face.

We will download the model using the class AutoProcessor class which is used to initialize model and also to download it.

Boosting Whisper's Local Inference Speed

Congratulations on incorporating the Whisper model into your local environment or server! Yet, it's common to encounter somewhat slower inference times with local models relative to cloud-based inference APIs. This isn't unusual, as local resources are often limited compared to optimized cloud services. However, we have several strategies at your disposal to enhance the inference speed of Whisper on our local machine.

Precision point difference -

Image source :Freepik

32-bit floating-point (FP32):

In this Ram reads each word thoughtfully and with exact pronunciation, but this careful reading takes more time. Similarly in Whisper, it is the standard precision that offers the highest accuracy. The higher the bit count, the more data needs to be processed, which can slow down Whisper's local inference speed.

16-bit floating-point (FP16):

In this Ram reads faster, occasionally skipping a word or mispronouncing them. Similarly in Whisper, Halving the number of bits can lead to faster inference due to lesser data churning. Yet, it's a balancing act; we might trade off a little accuracy for speed.

8-bit integer (INT8):

In this Ram rushes through the sentences, words tumbling out in haste. Similarly in Whisper, This can significantly speed up inference times but potential loss in the accuracy of Whisper's transcriptions.

When employing CPUs for Whisper, not all precisions yield equal benefits. While FP16 may theoretically promise faster processing, many CPU architectures aren't optimized for 16-bit operations. This can result in internal up-conversion to FP32 for computation and a subsequent reversion to FP16, introducing computational overhead and potentially yielding slower or comparable inference times to FP32.

In contrast, INT8 often achieves a more noticeable decrease in inference duration on CPUs. However, this efficiency comes at the expense of significant data loss, potentially compromising the reliability of transcribed data.

The impact of precision on CPU performance can be visually demonstrated through analysis data screenshots:

Inference time with all precision point in the CPU

Each test is done 10 times to to get reliable results.

output of float16 ,and float 32 (similar)

output of int8.

On the other hand, pre-trained INT8 versions of the Whisper small model aren't readily available on the Hugging Face Hub for GPUs. However, the GPU architecture introduces a notable difference in inference time between float16 and float32 without compromising accuracy. The impact of precision on GPU performance can be visually demonstrated through analysis data screenshot:

Each test is done 10 times to get reliable results

Batch processing -

Batch processing is the second method to enhance Whisper inference speed. Instead of passing input data sequentially, it is passed in batches, optimizing resource utilization for efficient pre-processing and post-processing.

It's crucial to note that the Whisper model doesn't inherently support the core batch processing concept where batches of inputs are converted into a single matrix.

Despite this limitation, implementing batch processing can still yield some reduction in inference time due to pre-processing and post-processing in batches.

If you're interested in delving deeper into the batch processing in machine learning, I recommend checking out my dedicated blog on this subject.

Blog Name: Batch Processing in Machine Learning: Navigating Through Data with Efficiency.

Model selection -

Choosing a Whisper model is akin to selecting the perfect vehicle for a journey do you prioritize the stable steadiness of a vintage car (accuracy) or the sleek rush of a sports car (speed) ? This decision is a critical step when running Whisper on your local machine or server. Several factors come into play, such as the intended use, whether you prioritize accuracy or speed, and the capabilities of your hardware.

Hugging Face offers numerous Whisper models as open source, each with different configurations and use cases. There are six models from OpenAI itself: tiny, base, small,

medium,large v2and large v3.

Additionally, there's the Distil-Whisper series, representing an advancement over OpenAI models. For instance, Distil small is not only six times faster but also 49% smaller than the OpenAI small model, making it a lighter option. However, it's worth noting that Distil models are currently available exclusively for English speech recognition.

Conclusion -

Adding Whisper to your computer for talking in different languages is super useful. The Hugging Face transformers library makes it easy to get and save the Whisper model. If you want it to work faster on your computer, you can try different tricks like changing settings and picking the best Whisper model. Whisper is like a bridge that brings together different voices smoothly as we move into the future of talking.

This blog details my learnings from my internship at Softude while working with

Mradul Kanugo.

Sources of Article

OpenAI, Huggingface, India Today, Freepik

DISCOVER MORE ARTICLES

The Hallucination Phenomenon in Large Language Models: Understanding and Addressing the Challenge

Large Language Models (LLMs) like OpenAI's ChatGPT, Microsoft's Bing Chat, and Google's Gemini have revolutionised the way we interact with technology.

Article large language models Aug 26, 2024