Rapid Speaker Adaptation in Low Resource Text to Speech Systems using   Synthetic Data and Transfer learning

Raviraj Joshi; Nikesh Garera

arXiv:2312.01107·cs.LG·December 5, 2023·1 cites

Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning

Raviraj Joshi, Nikesh Garera

PDF

Open Access

TL;DR

This paper introduces a transfer learning and synthetic data approach for rapid speaker adaptation in low-resource TTS systems, enabling high-quality Hindi speech synthesis with minimal target data.

Contribution

It proposes a novel three-step transfer learning method combining high-resource language data and synthetic data for low-resource TTS adaptation.

Findings

01

Effective speaker adaptation with only 3 hours of data

02

Synthetic data improves TTS quality in low-resource settings

03

Transfer learning from high-resource language accelerates model training

Abstract

Text-to-speech (TTS) systems are being built using end-to-end deep learning approaches. However, these systems require huge amounts of training data. We present our approach to built production quality TTS and perform speaker adaptation in extremely low resource settings. We propose a transfer learning approach using high-resource language data and synthetically generated data. We transfer the learnings from the out-domain high-resource English language. Further, we make use of out-of-the-box single-speaker TTS in the target language to generate in-domain synthetic data. We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi. We use a Tacotron2 like setup with a spectrogram prediction network and a waveglow vocoder. The Tacotron2 acoustic model is trained on English data, followed by synthetic Hindi data from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsNormalizing Flows · Affine Coupling · Invertible 1x1 Convolution · WaveGlow