Get featured on INDIAai

Contribute your expertise or opinions and become part of the ecosystem!

Facebook AI has launched a real-time neural text-to-speech(TTS) system that is highly efficient. TTS delivers industry-leading compute efficiency, and human-level audio quality claims Facebook. The system that currently powers Facebook’s CPU produces a second of audio in 500 milliseconds. Deployed in its video-calling device and available for use across a range of other Facebook applications along with Facebook portal, TTS system is highly flexible and will help to create and scale new voice applications that sound more human and expressive and are more enjoyable to use, says the company.

Modern AI-based TTS systems use neural networks to imitate the human voice. To create a humanlike voice, one second of speech requires a TTS system to output as many as 24,000 samples or more. This needs massive computation, which often needs to run on GPUs or other specialized hardware like graphics cards, field-programmable gate arrays (FPGAs), or custom-designed AI chips like Google’s tensor processing units (TPUs) to run, train, or both. Recently a detailed Google AI system was trained across 32 TPUs in parallel.

The TTS built by Facebook has state-of-the-art audio quality, which enables the service to be hosted in real-time using regular CPUs — without any specialized hardware. The social media company says that its system attained 160 times speedup compared with a baseline. This makes it fit for computationally constrained devices.

“All these advancements are part of our broader efforts in making systems capable of nuanced, natural speech that fits the content and the situation. When combined with our cutting-edge research in empathy and conversational AI, this work will play an important role in building truly intelligent, human-level AI assistants for everyone,” say Facebook engineers in its blogpost.

Facebook’s TTS has four parts, a linguistic front-end that converts input text to a sequence of linguistic features, such as phonemes and sentence type, a prosody model that predicts the rhythm and melody to create the expressive qualities of natural speech, an acoustic model that generates the spectral representation of the speech and a neural vocoder that generates 24 kHz speech waveform conditioned on prosody and spectral features.

Modern commercial speech synthesis systems like Facebook’s use data sets that often contain 40,000 sentences or more. To avail training data, the company’s engineers used a corpus of open domain speech recordings and selected lines from large, unstructured data sets. The data sets are also refined by a language model like readability, phonetic and prosodic diversity, natural and so on.

Facebook plans to add more accents, dialogues, and languages along with French, German, Italian, and Spanish in TTS. The company says it will make the system more light and efficient to run on smaller devices.

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in