Results for ""
Facebook’s breakthrough in the speech recognition systems may be a pathway for better performance in speech recognition across nearly 7,000 spoken languages worldwide.
Neural networks’ performance currently is in direct correlation with the amount of labelled training data they receive - the more the data, the more efficient the training. However, the availability of labelled is harder than unlabelled data, especially in areas such as speech recognition.
Current speech recognition systems require thousands of hours of labelled, transcribed recorded speeches to achieve acceptable performances. However, such requirements mean that the systems can not perform in thousands of regional and unique languages do not have such extensively labelled data. While artificial intelligence is created in the image of human beings, learning vastly from labelled examples is far from how humans acquire language skills - infants hear the language to learn it.
Facebook’s research team has achieved just that - they claim that they have created an ultra-low resource speech recognition that requires minimal learning datasets of recordings and fine-tuning on transcribed speech. A pre-print paper published by the team on pre-print server Arvix.org on wave2vec 2.0 demonstrates how wav2vec outperforms industry’s best speech recognition models that were trained via semi-supervision with just 10 minutes of labelled data and pre-training on 53,000 hours of unlabelled data.
Facebook’s wav2vec is a model that sidesteps the requirement for hours of labelled training by a self-supervising model that automatically labels from the data. They achieve this by introducing an encoder module that converts raw audio into speech representations which then pass through Google’s Transformer that secures the representations capture whole-audio-sequence information. Using the Transformer’s feature to predict the next occurrence in a sequence, wave2vec 2.0 builds context representations upon continuous speech representations and record statistical dependencies over audio sequences end-to-end.
For the pre-training, the researchers split a few portions of the audio recordings and asked the system to identify the following portions correctly. The researchers further fine-tune the system for speech recognition by adding tokens for characters and boundaries such as word spaces, full-stops, commas, etc. The wav2vec training lasted 5.2 days on Nvidia’s 128, V100 graphics cards for performance evaluation.
The most supervision trained wav2vec 2.0 model was trained on labelled data for only 10 minutes. It achieved a word error rate, a number of errors dived by total words, of 5.7 when tested on the open-source Librispeech database. With ten times more labelled data, the error rate reduced to 2.3, which further reduced to 1.9 with even more data. This is significantly lower than other semi-supervised methods that are more sophisticated in structure and require more hours of training.
“[This] demonstrates that ultra-low resource speech recognition is possible with self-supervised learning on unlabeled data,” the researchers wrote. “We have shown that speech recognition models can be built with very small amounts of annotated data at very good accuracy. We hope our work will make speech recognition technology more broadly available to many more languages and dialects.”
The researchers plan to make the models and code available as an extension to its fairseq modelling toolkit. The original version of wav2vec was used by the social media giant for better keyword spotting and acoustic event detection. The newer version will also hopefully improve upon Facebook’s goal to identify posts that violate its community guidelines, extending this ability to more languages.