WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie,, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, Zhendong Peng

TL;DR
WenetSpeech is a comprehensive, large-scale Mandarin speech corpus covering diverse domains, speaker styles, and noisy conditions, designed to advance speech recognition research with extensive labeled and unlabeled data.
Contribution
The paper introduces WenetSpeech, the largest open-source Mandarin speech corpus with over 10,000 hours of multi-domain data, and presents novel data collection, segmentation, and validation methods.
Findings
Provides benchmark results for Kaldi, ESPnet, and WeNet.
Demonstrates the corpus's diversity and quality for robust speech recognition.
Establishes new standards for Mandarin speech datasets.
Abstract
In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsTest
