Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning
Raviraj Joshi, Nikesh Garera

TL;DR
This paper introduces a transfer learning and synthetic data approach for rapid speaker adaptation in low-resource TTS systems, enabling high-quality Hindi speech synthesis with minimal target data.
Contribution
It proposes a novel three-step transfer learning method combining high-resource language data and synthetic data for low-resource TTS adaptation.
Findings
Effective speaker adaptation with only 3 hours of data
Synthetic data improves TTS quality in low-resource settings
Transfer learning from high-resource language accelerates model training
Abstract
Text-to-speech (TTS) systems are being built using end-to-end deep learning approaches. However, these systems require huge amounts of training data. We present our approach to built production quality TTS and perform speaker adaptation in extremely low resource settings. We propose a transfer learning approach using high-resource language data and synthetically generated data. We transfer the learnings from the out-domain high-resource English language. Further, we make use of out-of-the-box single-speaker TTS in the target language to generate in-domain synthetic data. We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi. We use a Tacotron2 like setup with a spectrogram prediction network and a waveglow vocoder. The Tacotron2 acoustic model is trained on English data, followed by synthetic Hindi data from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsNormalizing Flows · Affine Coupling · Invertible 1x1 Convolution · WaveGlow
