Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning
Giuseppe Ruggiero, Enrico Zovato, Luigi Di Caro, Vincent Pollet

TL;DR
This paper presents a transfer learning-based multi-speaker text-to-speech synthesis approach that can generate speech resembling various target speakers, including unseen ones, without retraining the model for each new voice.
Contribution
The proposed method enables multi-speaker TTS with transfer learning, reducing data collection and training effort for new speakers compared to traditional single-speaker models.
Findings
Able to synthesize speech of unseen speakers
Reduces need for large datasets per speaker
Demonstrates effective transfer learning in TTS
Abstract
Deep learning models are becoming predominant in many fields of machine learning. Text-to-Speech (TTS), the process of synthesizing artificial speech from text, is no exception. To this end, a deep neural network is usually trained using a corpus of several hours of recorded speech from a single speaker. Trying to produce the voice of a speaker other than the one learned is expensive and requires large effort since it is necessary to record a new dataset and retrain the model. This is the main reason why the TTS models are usually single speaker. The proposed approach has the goal to overcome these limitations trying to obtain a system which is able to model a multi-speaker acoustic space. This allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
