Can we use Common Voice to train a Multi-Speaker TTS system?
Sewade Ogun, Vincent Colotte, Emmanuel Vincent

TL;DR
This paper demonstrates that using a non-intrusive MOS estimator to select high-quality samples from crowdsourced datasets like Common Voice can effectively train multi-speaker TTS systems, improving output quality and enabling broader language coverage.
Contribution
It introduces WV-MOS, a method for automatic quality-based sample selection, and shows its effectiveness in training a multi-speaker GlowTTS model on noisy, crowdsourced data.
Findings
Training on selected samples improves MOS by 1.26 points.
Using WV-MOS outperforms training on all samples or LibriTTS.
The approach facilitates TTS dataset curation for diverse languages.
Abstract
Training of multi-speaker text-to-speech (TTS) systems relies on curated datasets based on high-quality recordings or audiobooks. Such datasets often lack speaker diversity and are expensive to collect. As an alternative, recent studies have leveraged the availability of large, crowdsourced automatic speech recognition (ASR) datasets. A major problem with such datasets is the presence of noisy and/or distorted samples, which degrade TTS quality. In this paper, we propose to automatically select high-quality training samples using a non-intrusive mean opinion score (MOS) estimator, WV-MOS. We show the viability of this approach for training a multi-speaker GlowTTS model on the Common Voice English dataset. Our approach improves the overall quality of generated utterances by 1.26 MOS point with respect to training on all the samples and by 0.35 MOS point with respect to training on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
