Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario
Zexin Cai, Yaogen Yang, Ming Li

TL;DR
This paper extends Tacotron2 to enable cross-lingual multispeaker TTS with limited data, allowing high-quality bilingual speech synthesis, code-switching, and cross-lingual speaker capabilities.
Contribution
It introduces a bilingual multispeaker TTS model that works with limited data, sharing phonemic representations and independently controlling language and speaker identity.
Findings
High-fidelity speech for monolingual speakers in non-native languages
Robust code-switching synthesis with bilingual training data
Monolingual speakers can speak fluently in non-native languages
Abstract
Modeling voices for multiple speakers and multiple languages in one text-to-speech system has been a challenge for a long time. This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis when there are limited data for each language. We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers. The two languages share the same phonemic representations for input, while the language attribute and the speaker identity are independently controlled by language tokens and speaker embeddings, respectively. In addition, we investigate the model's performance on the cross-lingual synthesis, with and without a bilingual dataset during training. With the bilingual dataset, not only can the model generate high-fidelity speech for all speakers concerning the language they speak, but also can generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
