RSET: Remapping-based Sorting Method for Emotion Transfer Speech   Synthesis

Haoxiang Shi; Jianzong Wang; Xulong Zhang; Ning Cheng; Jun Yu; Jing; Xiao

arXiv:2405.17028·cs.SD·May 28, 2024

RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

Haoxiang Shi, Jianzong Wang, Xulong Zhang, Ning Cheng, Jun Yu, Jing, Xiao

PDF

Open Access

TL;DR

This paper introduces RSET, a novel emotion transfer TTS model that enables fine-grained emotion intensity control by remapping intra-class intensity and decoupling speaker and emotion information, resulting in more expressive speech synthesis.

Contribution

The paper proposes a remapping-based sorting method combined with Mutual Information to improve emotion intensity control and decouple speaker and emotion features in TTS.

Findings

01

Achieves fine-grained emotion intensity control.

02

Preserves speaker identity during emotion transfer.

03

Produces more expressive speech with perceptible emotion differences.

Abstract

Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing