WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech   Generation Model Benchmark

Linhan Ma; Dake Guo; Kun Song; Yuepeng Jiang; Shuai Wang; Liumeng Xue,; Weiming Xu; Huan Zhao; Binbin Zhang; Lei Xie

arXiv:2406.05763·eess.AS·June 21, 2024·1 cites

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue,, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie

PDF

Open Access 1 Repo 5 Models 1 Datasets

TL;DR

WenetSpeech4TTS is a large, high-quality 12,800-hour Mandarin speech corpus designed for TTS model training and benchmarking, derived from WenetSpeech with improved segmentation and filtering.

Contribution

This work introduces WenetSpeech4TTS, a refined, multi-domain Mandarin TTS dataset with quality-based subsets and benchmark results for TTS system evaluation.

Findings

01

VALL-E and NaturalSpeech 2 trained on WenetSpeech4TTS subsets

02

Benchmark results established for fair TTS system comparison

03

Public availability of the corpus and benchmarks

Abstract

With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12, 800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dukGuo/valle-audiodec
pytorchOfficial

Models

Datasets

Wenetspeech4TTS/WenetSpeech4TTS
dataset· 1.0k dl
1.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and dialogue systems