Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection
Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

TL;DR
This paper introduces a novel data selection method for text-to-speech synthesis that leverages dark data like YouTube videos by using an evaluation-in-the-loop approach to improve model training.
Contribution
It proposes a new data selection technique based on predicted synthetic speech quality, enabling effective use of dark data for TTS models.
Findings
Outperforms conventional acoustic-quality-based data selection methods
Effective utilization of dark data like YouTube videos for TTS training
Improved TTS model robustness and speaker variation
Abstract
This paper proposes a method for selecting training data for text-to-speech (TTS) synthesis from dark data. TTS models are typically trained on high-quality speech corpora that cost much time and money for data collection, which makes it very challenging to increase speaker variation. In contrast, there is a large amount of data whose availability is unknown (a.k.a, "dark data"), such as YouTube videos. To utilize data other than TTS corpora, previous studies have selected speech data from the corpora on the basis of acoustic quality. However, considering that TTS models robust to data noise have been proposed, we should select data on the basis of its importance as training data to the given TTS model, not the quality of speech itself. Our method with a loop of training and evaluation selects training data on the basis of the automatically predicted quality of synthetic speech of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
