The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage
Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cer\'on, Keith, Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, Vijay, Janapa Reddi

TL;DR
The paper introduces 'The People's Speech', a large, diverse, and freely available English speech dataset of 30,000 hours, designed for improving speech recognition models in both academic and commercial contexts.
Contribution
It presents a novel large-scale, openly licensed speech dataset collected from internet sources, along with its data collection methodology and licensing details.
Findings
Model trained on dataset achieves 9.98% WER on Librispeech test set.
Dataset is openly licensed for academic and commercial use.
Discusses legal and ethical considerations of dataset creation.
Abstract
The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection system under the Apache 2.0 license. We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set.Finally, we discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons's sponsorship.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
