ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual   Multi-Speaker Text-to-Speech

Xiaoran Fan; Chao Pang; Tian Yuan; He Bai; Renjie Zheng; Pengfei Zhu,; Shuohuan Wang; Junkun Chen; Zeyu Chen; Liang Huang; Yu Sun; Hua Wu

arXiv:2211.03545·eess.AS·December 6, 2022

ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech

Xiaoran Fan, Chao Pang, Tian Yuan, He Bai, Renjie Zheng, Pengfei Zhu,, Shuohuan Wang, Junkun Chen, Zeyu Chen, Liang Huang, Yu Sun, Hua Wu

PDF

Open Access 2 Repos

TL;DR

This paper introduces ERNIE-SAT, a joint speech-text pretraining framework that enhances cross-lingual multi-speaker speech synthesis, voice cloning, and editing without finetuning, outperforming existing speaker-embedding methods.

Contribution

It proposes a novel speech-text joint pretraining approach for cross-lingual multi-speaker TTS, voice cloning, and editing, with an end-to-end training and inference process.

Findings

01

Significant improvements over speaker-embedding methods.

02

Effective cross-lingual multi-speaker voice cloning.

03

Successful speech editing in multiple languages.

Abstract

Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes given a speech example and its transcription. By learning to reconstruct the masked parts of the input in different languages, our model shows great improvements over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework is end-to-end for both the training and the inference without any finetuning effort. In cross-lingual multi-speaker voice cloning and cross-lingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques