GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech
Yahuan Cong, Haoyu Zhang, Haopeng Lin, Shichao Liu, Chunfeng Wang, Yi, Ren, Xiang Yin, Zejun Ma

TL;DR
GenerTTS introduces a novel approach for disentangling timbre and style in cross-lingual TTS, enabling better generalization of speaker identity and style across languages by using a HuBERT-based information bottleneck and mutual information minimization.
Contribution
The paper presents a new method for disentangling timbre and style in cross-lingual TTS using a HuBERT-based information bottleneck and mutual information minimization, improving style similarity and pronunciation accuracy.
Findings
Outperforms baseline in style similarity
Achieves better pronunciation accuracy
Enables cross-lingual timbre and style generalization
Abstract
Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to synthesize speech with a specific reference timbre or style that is never trained in the target language. It encounters the following challenges: 1) timbre and pronunciation are correlated since multilingual speech of a specific speaker is usually hard to obtain; 2) style and pronunciation are mixed because the speech style contains language-agnostic and language-specific parts. To address these challenges, we propose GenerTTS, which mainly includes the following works: 1) we elaborately design a HuBERT-based information bottleneck to disentangle timbre and pronunciation/style; 2) we minimize the mutual information between style and language to discard the language-specific information in the style embedding. The experiments indicate that GenerTTS outperforms baseline systems in terms of style similarity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
