Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining
Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke, Takamichi, Hiroshi Saruwatari

TL;DR
This paper introduces a zero-shot multilingual text-to-speech system that leverages unsupervised text pretraining and cross-lingual transfer to synthesize speech in languages with only textual resources, expanding TTS accessibility.
Contribution
It presents a novel framework combining masked language model pretraining and supervised training with frozen language embeddings for zero-shot multilingual TTS.
Findings
Achieves less than 12% character error rate on unseen languages.
Enables TTS synthesis for low-resource languages using only text data.
Demonstrates high intelligibility in zero-shot multilingual speech synthesis.
Abstract
While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
