Results for ""
OpenAI researchers trained and released Whisper, a neural network that approaches human levels of robustness and accuracy in English speech recognition.
What is Whisper?
Whisper is an automatic speech recognition (ASR) system that was trained on 680,000 hours of supervised data from the web in multiple languages and involved multiple tasks. We show that using such an extensive and varied set of data makes the system more resistant to things like accents, background noise, and technical language. It also lets you transcribe in more than one language and translate from those languages into English. We are making our models and inference code public so we can use them to build useful apps and do more research on making speech processing more reliable.
Architecture
The Whisper architecture is a simple end-to-end method implemented as an encoder-decoder Transformer. The audio that comes in is broken up into 30-second pieces, turned into a log-Mel spectrogram, and then sent to an encoder.
A decoder is trained to predict the corresponding text caption and unique tokens for a single model to accomplish tasks like language recognition, phrase-level timing, multilingual speech transcription, and English speech translation.
Existing approaches
Other approaches commonly use smaller, more closely paired audio-text training datasets or broad but unsupervised audio pretraining. Whisper does not outperform models specializing in LibriSpeech performance, a notoriously competitive benchmark in speech recognition, because it was trained on a large and diverse dataset and was not fine-tuned to any specific one. Yet when the researchers test Whisper's zero-shot performance across different datasets, they find that it is much more stable and makes 50% fewer mistakes than those models.
Conclusion
Approximately one-third of Whisper's audio dataset is non-English, alternately tasked with transcribing in the original language or translating to English. According to the researchers, this approach is efficient at learning speech-to-text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.
Furthermore, the OpenAI researchers hope that Whisper's high accuracy and ease of use will enable developers to incorporate voice interfaces into a broader range of applications.