Get featured on INDIAai

Contribute your expertise or opinions and become part of the ecosystem!

Digital assistants like Siri and Alexa produce almost human-like speech due to their advanced Text-to-speech (TTS) engines. TTS engines process natural language text and produce human-like speech.

However, creating that advanced a TTS model is currently complex and often time-consuming. The engines are usually trained in separate, multiple stages so that a signal can be transformed via a multi-stage pipeline such as text normalisation, aligned linguistic featurisation, raw audio waveform synthesis, etc.

These models are also expensive to train because of the complexities. But, UK-based, Artificial Intelligence (AI) company DeepMind has decided to take on this problem with an innovative solution.

Their latest offering EATS aka End-to-end Adversarial TTS is a generative model that produces high fidelity audio via adversarial feedback and prediction losses in an end-to-end manner. According to the paper released by the Google-owned company’s team on arvix.com this Tuesday, the results generated so far have proven that EATS performs as well as the SOTA models which depend on intensive training and added support.

The model is trained to map how characters and phonemes via the input sequence to raw audio form at 24 kHz. However, both input text and output speech aren’t usually the same lengths, nor are they aligned. EATS deploys two submodules to deal with this problem. An aligner anticipates the time span of each unique input and produces an audio-aligned clip. Then, a decoder upsamples, i.e. makes high-resolution audio data, to complete audio frequency.

The EATS model have several remarkable achievements that are worth mentioning. The generator architecture is distinguishable, and it is trained end-to-end. EATS feed-forward convolutional neural network lets it be good at applications which need fast inference. The model’s adversarial approach allows the generator to train on comparatively weak data, thus bringing down the cost of annotations significantly. As it learns on its own, sans the autoregressive sampling or teacher forcing, several issues are circumvented such as exposure bias and less parallelism at inference time, so EATS is methodical and structured in training and inference.

The Deepmind researchers tested EATS efficiency using Mean Opinion Score (MOS) to measure speech quality. All the models were trained on speech datasets that used almost 70 professional voice performers who spoke North American English.

In these tests, EATS achieved a score of 4.083 on a 5 point scale, performing almost as well as the other state-of-the-art models such as GAN-TTS and WaveNet that follow the SOTA methods. The advantage is that it requires less supervision and multi-level training as compared to them.

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in