Multi-Speaker Expressive Speech Synthesis via Multiple Factors   Decoupling

Xinfa Zhu; Yi Lei; Kun Song; Yongmao Zhang; Tao Li; Lei Xie

arXiv:2211.10568·eess.AS·March 15, 2023

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

Xinfa Zhu, Yi Lei, Kun Song, Yongmao Zhang, Tao Li, Lei Xie

PDF

Open Access

TL;DR

This paper presents a novel multi-factor disentanglement framework for expressive speech synthesis that effectively transfers style and emotion from reference speech to target speakers using a two-stage neural approach.

Contribution

It introduces a multi-factor decoupling method with multi-label binary vectors and mutual information minimization, along with semi-supervised training and an attention-based reference selection for improved expressive speech synthesis.

Findings

01

Effective disentanglement of speaker, style, and emotion factors.

02

High-quality style and emotion transfer in non-parallel data.

03

Robust synthesis across multiple speakers and expressions.

Abstract

This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labeled data, style-labeled data, and unlabeled data. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing