Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ, Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

TL;DR
This paper introduces a multilingual speech synthesis model capable of high-quality, cross-language voice transfer without bilingual data, using phonemic inputs and adversarial training to disentangle speaker identity from language content.
Contribution
The novel model enables cross-language voice cloning and fluent speech synthesis across distant languages without bilingual training data, advancing multilingual TTS technology.
Findings
Effective voice transfer across languages, including distant ones like English and Mandarin.
High-quality speech synthesis in multiple languages with consistent speaker identity.
Model can generate speech in native or foreign accents for multiple speakers.
Abstract
We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related languages, e.g. English and Mandarin. Critical to achieving this result are: 1. using a phonemic input representation to encourage sharing of model capacity across languages, and 2. incorporating an adversarial loss term to encourage the model to disentangle its representation of speaker identity (which is perfectly correlated with language in the training data) from the speech content. Further scaling up the model by training on multiple speakers of each language, and incorporating an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
MethodsGriffin-Lim Algorithm · Sigmoid Activation · Highway Layer · Residual Connection · Convolution · Batch Normalization · Max Pooling · Residual GRU · Bidirectional GRU · Highway Network
