Scaling ASR Improves Zero and Few Shot Learning
Alex Xiao, Weiyi Zheng, Gil Keren, Duc Le, Frank Zhang, Christian, Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed

TL;DR
This paper demonstrates that scaling up automatic speech recognition models to billions of parameters, combined with data selection and optimization techniques, significantly enhances zero and few-shot learning across diverse domains and styles.
Contribution
The authors introduce data selection and optimization methods to efficiently scale ASR models up to 10 billion parameters, achieving state-of-the-art zero and few-shot performance.
Findings
Models outperform previous benchmarks in zero and few-shot tasks.
Significant improvements in speech recognition for speakers with disorders.
Achieves high performance with substantially less in-domain data.
Abstract
With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such as sparse transducer loss and model sharding. By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains. Furthermore, our models learn powerful speech representations with zero and few-shot capabilities on novel domains and styles of speech, exceeding previous results across multiple in-house and public benchmarks. For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research
