EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech   Synthesis

Haobin Tang; Xulong Zhang; Jianzong Wang; Ning Cheng; Jing Xiao

arXiv:2306.00648·cs.SD·June 2, 2023·1 cites

EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

PDF

Open Access

TL;DR

EmoMix is a diffusion-based emotional TTS system that enables the synthesis of speech with mixed emotions and controllable intensity, overcoming limitations of previous methods in emotion diversity and intensity regulation.

Contribution

Introduces EmoMix, a novel diffusion model-based TTS framework capable of generating mixed emotions and controlling emotional intensity in speech synthesis.

Findings

01

Effective in synthesizing speech with mixed emotions.

02

Demonstrates precise control over emotional intensity.

03

Outperforms existing methods in emotion diversity and intensity accuracy.

Abstract

There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition