EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis
Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

TL;DR
EmoMix is a diffusion-based emotional TTS system that enables the synthesis of speech with mixed emotions and controllable intensity, overcoming limitations of previous methods in emotion diversity and intensity regulation.
Contribution
Introduces EmoMix, a novel diffusion model-based TTS framework capable of generating mixed emotions and controlling emotional intensity in speech synthesis.
Findings
Effective in synthesizing speech with mixed emotions.
Demonstrates precise control over emotional intensity.
Outperforms existing methods in emotion diversity and intensity accuracy.
Abstract
There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition
