Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine, McLeavey, Ilya Sutskever

TL;DR
This paper demonstrates that large-scale weakly supervised training on 680,000 hours of audio data enables speech recognition models to achieve competitive zero-shot performance and robustness comparable to humans, without fine-tuning.
Contribution
It introduces a large-scale weak supervision approach for speech recognition that achieves strong zero-shot results and robustness, surpassing prior fully supervised models in some benchmarks.
Findings
Models trained on 680,000 hours generalize well to benchmarks.
Zero-shot models perform competitively without fine-tuning.
Models approach human-level accuracy and robustness.
Abstract
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗openai/whisper-large-v3model· 4.7M dl· ♡ 55414.7M dl♡ 5541
- 🤗openai/whisper-large-v3-turbomodel· 5.0M dl· ♡ 28895.0M dl♡ 2889
- 🤗openai/whisper-tinymodel· 736k dl· ♡ 421736k dl♡ 421
- 🤗openai/whisper-smallmodel· 1.8M dl· ♡ 5461.8M dl♡ 546
- 🤗xkeyC/whisper-large-v3-turbo-ggufmodel· 11k dl· ♡ 3011k dl♡ 30
- 🤗openai/whisper-mediummodel· 556k dl· ♡ 283556k dl♡ 283
- 🤗openai/whisper-largemodel· 39k dl· ♡ 54139k dl♡ 541
- 🤗openai/whisper-large-v2model· 72k dl· ♡ 178672k dl♡ 1786
- 🤗WhisperSpeech/WhisperSpeechmodel· ♡ 250♡ 250
- 🤗NeuralNovel/whisper-small-himodel· 4 dl· ♡ 44 dl♡ 4
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
