Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Zexin Cai; Yaogen Yang; Ming Li

arXiv:2005.10441·eess.AS·May 22, 2020·5 cites

Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Zexin Cai, Yaogen Yang, Ming Li

PDF

Open Access

TL;DR

This paper extends Tacotron2 to enable cross-lingual multispeaker TTS with limited data, allowing high-quality bilingual speech synthesis, code-switching, and cross-lingual speaker capabilities.

Contribution

It introduces a bilingual multispeaker TTS model that works with limited data, sharing phonemic representations and independently controlling language and speaker identity.

Findings

01

High-fidelity speech for monolingual speakers in non-native languages

02

Robust code-switching synthesis with bilingual training data

03

Monolingual speakers can speak fluently in non-native languages

Abstract

Modeling voices for multiple speakers and multiple languages in one text-to-speech system has been a challenge for a long time. This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis when there are limited data for each language. We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers. The two languages share the same phonemic representations for input, while the language attribute and the speaker identity are independently controlled by language tokens and speaker embeddings, respectively. In addition, we investigate the model's performance on the cross-lingual synthesis, with and without a bilingual dataset during training. With the bilingual dataset, not only can the model generate high-fidelity speech for all speakers concerning the language they speak, but also can generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling