Meta Learning Text-to-Speech Synthesis in over 7000 Languages
Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do,, Matt Coler, Emanu\"el A. P. Habets, Ngoc Thang Vu

TL;DR
This paper introduces a universal text-to-speech system capable of synthesizing speech in over 7000 languages, including many with no available data, using multilingual pretraining and meta learning.
Contribution
It presents a novel approach combining multilingual pretraining and meta learning to enable zero-shot TTS in extremely low-resource languages.
Findings
Effective zero-shot synthesis demonstrated across diverse languages
System outperforms baseline models in objective and human evaluations
Public release of code and models to support linguistic diversity
Abstract
In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Speech Recognition and Synthesis
