EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance
Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

TL;DR
EmoDiff is a diffusion-based TTS model that enables precise control of emotional intensity in speech through a novel soft-label guidance technique, maintaining high quality and diversity.
Contribution
It introduces a soft-label guidance method for emotion intensity control in diffusion-based TTS, improving flexibility and quality over existing approaches.
Findings
Accurately controls emotion intensity in speech
Maintains high speech quality during control
Generates diverse emotional speech samples
Abstract
Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a proposed soft-label guidance technique derived from classifier guidance. Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and \textit{Neutral} is set to and respectively. The here represents the emotion intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
