Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based   on Transfer Learning

Giuseppe Ruggiero; Enrico Zovato; Luigi Di Caro; Vincent Pollet

arXiv:2102.05630·cs.SD·February 11, 2021·6 cites

Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning

Giuseppe Ruggiero, Enrico Zovato, Luigi Di Caro, Vincent Pollet

PDF

Open Access

TL;DR

This paper presents a transfer learning-based multi-speaker text-to-speech synthesis approach that can generate speech resembling various target speakers, including unseen ones, without retraining the model for each new voice.

Contribution

The proposed method enables multi-speaker TTS with transfer learning, reducing data collection and training effort for new speakers compared to traditional single-speaker models.

Findings

01

Able to synthesize speech of unseen speakers

02

Reduces need for large datasets per speaker

03

Demonstrates effective transfer learning in TTS

Abstract

Deep learning models are becoming predominant in many fields of machine learning. Text-to-Speech (TTS), the process of synthesizing artificial speech from text, is no exception. To this end, a deep neural network is usually trained using a corpus of several hours of recorded speech from a single speaker. Trying to produce the voice of a speaker other than the one learned is expensive and requires large effort since it is necessary to record a new dataset and retrain the model. This is the main reason why the TTS models are usually single speaker. The proposed approach has the goal to overcome these limitations trying to obtain a system which is able to model a multi-speaker acoustic space. This allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing