GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of   Transcribed Audio

Guoguo Chen; Shuzhou Chai; Guanbo Wang; Jiayu Du; Wei-Qiang Zhang,; Chao Weng; Dan Su; Daniel Povey; Jan Trmal; Junbo Zhang; Mingjie Jin; Sanjeev; Khudanpur; Shinji Watanabe; Shuaijiang Zhao; Wei Zou; Xiangang Li; Xuchen; Yao; Yongqing Wang; Yujun Wang; Zhao You; Zhiyong Yan

arXiv:2106.06909·cs.SD·May 6, 2025·20 cites

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang,, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev, Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen, Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan

PDF

Open Access 3 Repos 10 Models 2 Datasets

TL;DR

GigaSpeech is a comprehensive, multi-domain English speech recognition corpus with 10,000 hours of high-quality transcribed audio, designed to advance supervised, semi-supervised, and unsupervised speech recognition research.

Contribution

This paper introduces GigaSpeech, a large-scale, multi-domain speech corpus with a novel segmentation pipeline and multiple training subsets, including a high-quality 10,000-hour dataset for speech recognition.

Findings

01

Provides a new large-scale speech corpus covering diverse domains.

02

Includes a novel segmentation pipeline for high-quality transcriptions.

03

Offers baseline systems for multiple speech recognition toolkits.

Abstract

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing