Transfer Learning from Speaker Verification to Multispeaker   Text-To-Speech Synthesis

Ye Jia; Yu Zhang; Ron J. Weiss; Quan Wang; Jonathan Shen; Fei Ren,; Zhifeng Chen; Patrick Nguyen; Ruoming Pang; Ignacio Lopez Moreno; Yonghui Wu

arXiv:1806.04558·cs.CL·January 4, 2019·434 cites

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren,, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

PDF

Open Access 5 Repos 2 Models 1 Video

TL;DR

This paper presents a neural TTS system that leverages transfer learning from speaker verification to synthesize natural speech in many voices, including unseen speakers, by using a speaker encoder trained on a large dataset.

Contribution

It introduces a novel transfer learning approach that uses a speaker verification network to improve multispeaker TTS, enabling synthesis of voices not seen during training.

Findings

01

The system can synthesize speech in unseen speakers' voices.

02

Training the speaker encoder on a large diverse dataset improves generalization.

03

Randomly sampled embeddings produce high-quality novel speaker voices.

Abstract

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Google's AI Clones Your Voice After Listening for 5 Seconds! 🤐· youtube

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing