WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech   Recognition

Binbin Zhang; Hang Lv; Pengcheng Guo; Qijie Shao; Chao Yang; Lei Xie,; Xin Xu; Hui Bu; Xiaoyu Chen; Chenchen Zeng; Di Wu; Zhendong Peng

arXiv:2110.03370·cs.SD·February 24, 2022·32 cites

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie,, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, Zhendong Peng

PDF

Open Access 2 Repos 2 Datasets

TL;DR

WenetSpeech is a comprehensive, large-scale Mandarin speech corpus covering diverse domains, speaker styles, and noisy conditions, designed to advance speech recognition research with extensive labeled and unlabeled data.

Contribution

The paper introduces WenetSpeech, the largest open-source Mandarin speech corpus with over 10,000 hours of multi-domain data, and presents novel data collection, segmentation, and validation methods.

Findings

01

Provides benchmark results for Kaldi, ESPnet, and WeNet.

02

Demonstrates the corpus's diversity and quality for robust speech recognition.

03

Establishes new standards for Mandarin speech datasets.

Abstract

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsTest