Adapting TTS models For New Speakers using Transfer Learning

Paarth Neekhara; Jason Li; Boris Ginsburg

arXiv:2110.05798·cs.SD·April 7, 2022

Adapting TTS models For New Speakers using Transfer Learning

Paarth Neekhara, Jason Li, Boris Ginsburg

PDF

Open Access

TL;DR

This paper presents transfer learning guidelines for adapting high-quality single-speaker TTS models to new speakers using only a few minutes of speech data, achieving comparable quality to models trained on much larger datasets.

Contribution

It introduces effective transfer learning strategies for speaker adaptation in TTS, enabling high-quality voice cloning with minimal data and addressing noise issues in multi-speaker datasets.

Findings

01

Fine-tuning on 30 minutes of data yields high-quality speech.

02

Comparable performance to models trained on 27+ hours of data.

03

Effective adaptation for both male and female speakers.

Abstract

Training neural text-to-speech (TTS) models for a new speaker typically requires several hours of high quality speech data. Prior works on voice cloning attempt to address this challenge by adapting pre-trained multi-speaker TTS models for a new voice, using a few minutes of speech data of the new speaker. However, publicly available large multi-speaker datasets are often noisy, thereby resulting in TTS models that are not suitable for use in products. We address this challenge by proposing transfer-learning guidelines for adapting high quality single-speaker TTS models for a new speaker, using only a few minutes of speech data. We conduct an extensive study using different amounts of data for a new speaker and evaluate the synthesized speech in terms of naturalness and voice/style similarity to the target speaker. We find that fine-tuning a single-speaker TTS model on just 30 minutes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing