One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech

Tom\'a\v{s} Nekvinda; Ond\v{r}ej Du\v{s}ek

arXiv:2008.00768·eess.AS·August 4, 2020

One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech

Tom\'a\v{s} Nekvinda, Ond\v{r}ej Du\v{s}ek

PDF

1 Repo

TL;DR

This paper presents a multilingual text-to-speech model using meta-learning that efficiently shares information across languages, producing natural, high-quality speech with less data and improved code-switching capabilities.

Contribution

The authors introduce a novel meta-learning based TTS model that enhances multilingual speech synthesis and voice cloning with less training data and improved cross-lingual performance.

Findings

01

Effective multilingual sharing across languages

02

Superior naturalness and accuracy in code-switching speech

03

Robust performance with limited training data

Abstract

We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more languages and less training data than previous approaches. Our model is based on Tacotron 2 with a fully convolutional input text encoder whose weights are predicted by a separate parameter generator network. To boost voice cloning, the model uses an adversarial speaker classifier with a gradient reversal layer that removes speaker-specific information from the encoder. We arranged two experiments to compare our model with baselines using various levels of cross-lingual parameter sharing, in order to evaluate: (1) stability and performance when training on low amounts of data, (2) pronunciation accuracy and voice quality of code-switching synthesis. For training, we used the CSS10 dataset and our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Tomiinek/Multilingual_Text_to_Speech
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSigmoid Activation · Long Short-Term Memory · Highway Layer · Mixture of Logistic Distributions · Residual Connection · Zoneout · Batch Normalization · Bidirectional LSTM · Location Sensitive Attention · Dilated Causal Convolution