Adapting TTS models For New Speakers using Transfer Learning
Paarth Neekhara, Jason Li, Boris Ginsburg

TL;DR
This paper presents transfer learning guidelines for adapting high-quality single-speaker TTS models to new speakers using only a few minutes of speech data, achieving comparable quality to models trained on much larger datasets.
Contribution
It introduces effective transfer learning strategies for speaker adaptation in TTS, enabling high-quality voice cloning with minimal data and addressing noise issues in multi-speaker datasets.
Findings
Fine-tuning on 30 minutes of data yields high-quality speech.
Comparable performance to models trained on 27+ hours of data.
Effective adaptation for both male and female speakers.
Abstract
Training neural text-to-speech (TTS) models for a new speaker typically requires several hours of high quality speech data. Prior works on voice cloning attempt to address this challenge by adapting pre-trained multi-speaker TTS models for a new voice, using a few minutes of speech data of the new speaker. However, publicly available large multi-speaker datasets are often noisy, thereby resulting in TTS models that are not suitable for use in products. We address this challenge by proposing transfer-learning guidelines for adapting high quality single-speaker TTS models for a new speaker, using only a few minutes of speech data. We conduct an extensive study using different amounts of data for a new speaker and evaluate the synthesized speech in terms of naturalness and voice/style similarity to the target speaker. We find that fine-tuning a single-speaker TTS model on just 30 minutes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing
