Transfer Learning Framework for Low-Resource Text-to-Speech using a   Large-Scale Unlabeled Speech Corpus

Minchan Kim; Myeonghun Jeong; Byoung Jin Choi; Sunghwan Ahn; Joun Yeop; Lee; Nam Soo Kim

arXiv:2203.15447·eess.AS·October 7, 2022

Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Sunghwan Ahn, Joun Yeop, Lee, Nam Soo Kim

PDF

TL;DR

This paper introduces a transfer learning framework for low-resource text-to-speech that leverages large unlabeled speech datasets and wav2vec2.0 representations to improve naturalness, intelligibility, and speaker generalization, even with minimal labeled data.

Contribution

The proposed method effectively utilizes unlabeled speech data for pre-training TTS models, enabling high-quality synthesis with very limited labeled datasets and extending to zero-shot multi-speaker TTS.

Findings

01

Single speaker TTS outperforms baselines with only 10 minutes of labeled data.

02

Zero-shot multi-speaker TTS generates arbitrary speaker voices with only 30 minutes of labeled data.

03

Pre-training on unlabeled multi-speaker speech significantly enhances TTS performance.

Abstract

Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also extend the proposed method to zero-shot multi-speaker TTS (ZS-TTS). The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization. We highlight that the single speaker TTS model fine-tuned on the only 10 minutes of labeled dataset outperforms the other baselines, and the ZS-TTS model fine-tuned on the only 30 minutes of single speaker dataset can generate the voice of the arbitrary speaker, by pre-training on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.