Text-to-speech synthesis from dark data with evaluation-in-the-loop data   selection

Kentaro Seki; Shinnosuke Takamichi; Takaaki Saeki; Hiroshi Saruwatari

arXiv:2210.14850·cs.SD·October 27, 2022

Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection

Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

PDF

Open Access

TL;DR

This paper introduces a novel data selection method for text-to-speech synthesis that leverages dark data like YouTube videos by using an evaluation-in-the-loop approach to improve model training.

Contribution

It proposes a new data selection technique based on predicted synthetic speech quality, enabling effective use of dark data for TTS models.

Findings

01

Outperforms conventional acoustic-quality-based data selection methods

02

Effective utilization of dark data like YouTube videos for TTS training

03

Improved TTS model robustness and speaker variation

Abstract

This paper proposes a method for selecting training data for text-to-speech (TTS) synthesis from dark data. TTS models are typically trained on high-quality speech corpora that cost much time and money for data collection, which makes it very challenging to increase speaker variation. In contrast, there is a large amount of data whose availability is unknown (a.k.a, "dark data"), such as YouTube videos. To utilize data other than TTS corpora, previous studies have selected speech data from the corpora on the basis of acoustic quality. However, considering that TTS models robust to data noise have been proposed, we should select data on the basis of its importance as training data to the given TTS model, not the quality of speech itself. Our method with a loop of training and evaluation selects training data on the basis of the automatically predicted quality of synthetic speech of a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing