GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang,, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev, Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen, Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan

TL;DR
GigaSpeech is a comprehensive, multi-domain English speech recognition corpus with 10,000 hours of high-quality transcribed audio, designed to advance supervised, semi-supervised, and unsupervised speech recognition research.
Contribution
This paper introduces GigaSpeech, a large-scale, multi-domain speech corpus with a novel segmentation pipeline and multiple training subsets, including a high-quality 10,000-hour dataset for speech recognition.
Findings
Provides a new large-scale speech corpus covering diverse domains.
Includes a novel segmentation pipeline for high-quality transcriptions.
Offers baseline systems for multiple speech recognition toolkits.
Abstract
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/wavlm-base-plusmodel· 552k dl· ♡ 36552k dl♡ 36
- 🤗microsoft/wavlm-largemodel· 351k dl· ♡ 102351k dl♡ 102
- 🤗microsoft/unispeech-sat-base-plus-sdmodel· 348 dl348 dl
- 🤗microsoft/unispeech-sat-base-plus-svmodel· 1.5k dl· ♡ 11.5k dl♡ 1
- 🤗microsoft/unispeech-sat-base-plusmodel· 592 dl592 dl
- 🤗microsoft/unispeech-sat-large-sdmodel· 11 dl· ♡ 211 dl♡ 2
- 🤗microsoft/unispeech-sat-large-svmodel· 1.2k dl· ♡ 51.2k dl♡ 5
- 🤗microsoft/unispeech-sat-largemodel· 677 dl· ♡ 1677 dl♡ 1
- 🤗microsoft/wavlm-base-plus-sdmodel· 124k dl· ♡ 12124k dl♡ 12
- 🤗microsoft/wavlm-base-plus-svmodel· 173k dl· ♡ 54173k dl♡ 54
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
