GenerTTS: Pronunciation Disentanglement for Timbre and Style   Generalization in Cross-Lingual Text-to-Speech

Yahuan Cong; Haoyu Zhang; Haopeng Lin; Shichao Liu; Chunfeng Wang; Yi; Ren; Xiang Yin; Zejun Ma

arXiv:2306.15304·eess.AS·June 28, 2023

GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech

Yahuan Cong, Haoyu Zhang, Haopeng Lin, Shichao Liu, Chunfeng Wang, Yi, Ren, Xiang Yin, Zejun Ma

PDF

Open Access

TL;DR

GenerTTS introduces a novel approach for disentangling timbre and style in cross-lingual TTS, enabling better generalization of speaker identity and style across languages by using a HuBERT-based information bottleneck and mutual information minimization.

Contribution

The paper presents a new method for disentangling timbre and style in cross-lingual TTS using a HuBERT-based information bottleneck and mutual information minimization, improving style similarity and pronunciation accuracy.

Findings

01

Outperforms baseline in style similarity

02

Achieves better pronunciation accuracy

03

Enables cross-lingual timbre and style generalization

Abstract

Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to synthesize speech with a specific reference timbre or style that is never trained in the target language. It encounters the following challenges: 1) timbre and pronunciation are correlated since multilingual speech of a specific speaker is usually hard to obtain; 2) style and pronunciation are mixed because the speech style contains language-agnostic and language-specific parts. To address these challenges, we propose GenerTTS, which mainly includes the following works: 1) we elaborately design a HuBERT-based information bottleneck to disentangle timbre and pronunciation/style; 2) we minimize the mutual information between style and language to discard the language-specific information in the style embedding. The experiments indicate that GenerTTS outperforms baseline systems in terms of style similarity and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis