The People's Speech: A Large-Scale Diverse English Speech Recognition   Dataset for Commercial Usage

Daniel Galvez; Greg Diamos; Juan Ciro; Juan Felipe Cer\'on; Keith; Achorn; Anjali Gopi; David Kanter; Maximilian Lam; Mark Mazumder; Vijay; Janapa Reddi

arXiv:2111.09344·cs.LG·November 19, 2021·6 cites

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cer\'on, Keith, Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, Vijay, Janapa Reddi

PDF

Open Access 5 Datasets

TL;DR

The paper introduces 'The People's Speech', a large, diverse, and freely available English speech dataset of 30,000 hours, designed for improving speech recognition models in both academic and commercial contexts.

Contribution

It presents a novel large-scale, openly licensed speech dataset collected from internet sources, along with its data collection methodology and licensing details.

Findings

01

Model trained on dataset achieves 9.98% WER on Librispeech test set.

02

Dataset is openly licensed for academic and commercial use.

03

Discusses legal and ethical considerations of dataset creation.

Abstract

The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection system under the Apache 2.0 license. We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set.Finally, we discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons's sponsorship.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis