OpenAI has recently released a new speech recognition model called Whisper. Unlike DALLE-2 and GPT-3, Whisper is a free and open-source model.

Whisper is an automatic speech recognition model trained on 680,000 hours of multilingual data collected from the web. As per OpenAI, this model is robust to accents, background noise and technical language. In addition, it supports 99 different languages’ transcription and translation from those languages into English.

This article explains how to convert speech into text using the Whisper model and Python. And, it won’t cover how the model works or the model architecture. You can check more about the Whisper here.

Whisper has five models (refer to the below table). Below is the table available on OpenAI’s GitHub page. According to OpenAI, four models for English-only applications, which is denoted as .en. The model performs better for tiny.en and base.en, however, differences would become less significant for the small.en and medium.en models.

For this article, I am converting YouTube video into audio and passing the audio into a Whisper model to convert it into text.

I used Google Colab with GPU to execute the below code.

Importing Pytube Library

!pip install -— upgrade pytube

Reading YouTube video and downloading as a MP4 file to transcribe.

In the first example, I am reading the famous Taken movie dialogue as per the YouTube video

#Importing Pytube library

import pytube
# Reading the above Taken movie Youtube link
video = ‘https://www.youtube.com/watch?v=-LIIf7E-qFI'
data = pytube.YouTube(video)
# Converting and downloading as 'MP4' file
audio = data.streams.get_audio_only()
audio.download()


Output

The above YouTube link has been downloaded as an ‘MP4’ file and stored under content.

Now, the next step is to convert audio into text. We can do this in three lines of code using Whisper.

Importing Whisper library

# Installing Whisper libary

!pip install git+https://github.com/openai/whisper.git -q
import whisper


Loading model

I am using medium multilingual model here and passing the above audio file I will find You I will Kill You Taken Movie best scene ever liam neeson.mp4 and stored as a text object.

model = whisper.load_model(“large”)

text = model1.transcribe(“I will find YouI will Kill You Taken Movie best scene ever liam neeson.mp4”)
#printing the transcribe
text['text']


Output

Below is the text from the audio. It exactly matches the audio.

I don’t know who you are. I don’t know what you want. If you are looking for ransom, I can tell you I don’t have money. But what I do have are a very particular set of skills. Skills I have acquired over a very long career. Skills that make me a nightmare for people like you. If you let my daughter go now, that will be the end of it. I will not look for you. I will not pursue you. But if you don’t, I will look for you. I will find you. And I will kill you. Good luck.

How about converting different audio language?

As we know, Whisper supports 99 languages; I am trying with Tamil Indian language and using the below movie clip video into text.

In this example, I used large model.

#Importing Pytube library

import pytube
# Reading the above tamil movie clip from Youtube link
video = ‘https://www.youtube.com/watch?v=H1HPYH2uMfQ'
data = pytube.YouTube(video)
# Converting and downloading as ‘MP4’ file
audio = data.streams.get_audio_only()
audio.download()


Output

Loading Large Model

#Loading large model
model = whisper.load_model(“large”)
text = model1.transcribe(“Petta mass dialogue with WhatsApp status 30 Seconds.mp4”)
#printing the transcribe
text['text']


Output

Model converted above Tamil audio clip into text. The model transcribed the audio well; however, I can see some small variation in the language.

சிறப்பான தரமான சம்பவங்களை இனிமேல் தான் பார்க்கப் போகிறேன். ஏய்.. ஏய்.. ஏய்.. சத்தியமா சொல்கிறேன். அடிச்சி அண்டு வேண்டும் என்று ஓழ்வு விட்டுடுவேன். மானம் போலம் திருப்பி வராது பார்த்துவிடு. ஏய்.. யாருக்காவது பொண்டாட்டி குழந்தைக் குட்டியன் சென்றும் குட்டும் என்று செய்துவிட்டு இருந்தால் அப்டியே ஓடி போய்டு.

I mainly tried medium and large models. It is robust and exactly transcribes the audio. Also, I transcribed a long audio maximum of 10 min using Azure Synapse notebook with GPU, which works very well.

And this is fully open source and free; we can directly use it for our speech recognition application in your projects. We can translate other languages into English as well. I will cover it in my next article with long audio and different languages in English.

You can check more about the Whisper model; please visit Whisper’s Github page.

Sources of Article

Read the original article here- https://towardsdatascience.com/speech-to-text-with-openais-whisper-53d5cea9005e

https://github.com/openai/whisper

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE