Speech recognition is the tech behind voice assistants (Apple’s Siri and Google’s Alexa), search engines, smart home devices, and helping businesses to streamline their services and develop tools for people with hearing and speech impairments. Moreover, the global market for speech recognition is expected to increase at a CAGR of 17.2 per cent to $26.8 billion by 2025, according to Research and Markets.

To convert this expected figure into reality and enhance its further applications, continuous efforts towards improving the technology is indispensable. To that end, the very first research the Meta AI team announced in 2022, from its official Twitter account, is of a new framework - Audio-Visual HuBERT (AV-HuBERT) to help build more versatile and robust AI speech recognition tools. 

But, why a new framework altogether?

“It is the first system to jointly model speech and lip movements from unlabeled data — the raw video that has not already been transcribed,” says the blog.

Most state-of-the-art speech recognition technology functions below par in everyday use cases. Imagine a situation where multiple people are speaking simultaneously, or the background noise of your pet’s barking creeps in - even the most sophisticated noise suppression technique fails to match up. Here, we humans outsmart these systems as we not just listen but use our eyes to observe as well. We might notice someone's mouth moving and intuitively know the voice we're hearing and from where it is coming. That's why Meta AI is developing new conversational AI systems that, like us, can recognise the subtleties of what they see and hear in a discussion.

The newly introduced AV-HuBERT is 75 per cent more accurate than the finest audio-visual speech recognition systems (using both sound and image of the speaker) when using the same number of transcriptions. The team trained the model using video recordings from the publicly available VoxCeleb and LRS3 data sets. Also, only a little amount of labelled data is required to train a model for a specific task or a different language if the pretrained model has learned the structure and association well.

Another major limitation is the availability of large labelled data for most of the world’s languages. Overcoming this challenge, it is fascinating to know that AV-HuBERT uses just one-tenth of the labelled data than what is required by other existing best audio-visual speech recognition systems. 

So, what’s next?

The team has open-sourced their code and made their pre-trained AV-HuBERT models available to researchers so that they can expand on their work and expedite ASR advancement. Some of the possibilities include:

  • Overcoming the background noise problem - By combining audio-visual data, AV-HuBERT could one day allow virtual assistants on augmented reality (AR) glasses, and smartphones to understand what we're saying regardless of the environment - whether we're on a noisy dance floor, at a concert, or simply speaking with roaring ocean waves behind.
  • New possibilities with less supervised data requirements - Many of the widely spoken languages such as English, Spanish, or Mandarin already have large-scale labelled data sets. However, the model from Meta may help develop conversational AI models for hundreds of millions of people throughout the world who speak some of the endemic languages.

As the model learns from both audio and mouth/lip movements, this may pave way for more inclusive speech recognition models for people with speech impairments. Also, they can help point deepfakes and manipulated content. The tech is here to stay, speech-to-text software is being used by many enterprises to improve their business operations and streamline the customer experience. Companies can transcribe calls, meetings, and even translate them using speech recognition and natural language processing. Apple, Facebook, Microsoft, Google, and Amazon are just a few of the tech behemoths that continue to deploy AI-powered speech recognition applications to create excellent user experiences.

On a completely different note, the demand for talented AI engineers and developers, machine learning engineers, data scientists is likely to be at an all-time high as speech recognition and AI touches both professional as well as personal life at workplaces and homes, respectively.

Sources of Article

Source: Meta AI

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in