Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with   Unsupervised Text Pretraining

Takaaki Saeki; Soumi Maiti; Xinjian Li; Shinji Watanabe; Shinnosuke; Takamichi; Hiroshi Saruwatari

arXiv:2301.12596·eess.AS·May 30, 2023

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke, Takamichi, Hiroshi Saruwatari

PDF

Open Access 1 Repo

TL;DR

This paper introduces a zero-shot multilingual text-to-speech system that leverages unsupervised text pretraining and cross-lingual transfer to synthesize speech in languages with only textual resources, expanding TTS accessibility.

Contribution

It presents a novel framework combining masked language model pretraining and supervised training with frozen language embeddings for zero-shot multilingual TTS.

Findings

01

Achieves less than 12% character error rate on unseen languages.

02

Enables TTS synthesis for low-resource languages using only text data.

03

Demonstrates high intelligibility in zero-shot multilingual speech synthesis.

Abstract

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

takaaki-saeki/zm-text-tts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques