Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford; Jong Wook Kim; Tao Xu; Greg Brockman; Christine; McLeavey; Ilya Sutskever

arXiv:2212.04356·eess.AS·December 9, 2022·1.2k cites

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine, McLeavey, Ilya Sutskever

PDF

Open Access 5 Repos 10 Models 5 Datasets 1 Video

TL;DR

This paper demonstrates that large-scale weakly supervised training on 680,000 hours of audio data enables speech recognition models to achieve competitive zero-shot performance and robustness comparable to humans, without fine-tuning.

Contribution

It introduces a large-scale weak supervision approach for speech recognition that achieves strong zero-shot results and robustness, surpassing prior fully supervised models in some benchmarks.

Findings

01

Models trained on 680,000 hours generalize well to benchmarks.

02

Zero-shot models perform competitively without fine-tuning.

03

Models approach human-level accuracy and robustness.

Abstract

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

Robust Speech Recognition via Large-Scale Weak Supervision· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing