Emotion Intensity and its Control for Emotional Voice Conversion
Kun Zhou, Berrak Sisman, Rajib Rana, Bj\"orn W. Schuller, Haizhou Li

TL;DR
This paper introduces a method for emotional voice conversion that explicitly models and controls emotion intensity, enabling more expressive and nuanced speech synthesis while preserving content and speaker identity.
Contribution
It proposes a novel approach to disentangle speaker style from content and encode emotion intensity in a continuous space, improving emotional expressiveness in voice conversion.
Findings
Effective control of emotion intensity demonstrated
Improved emotional expressiveness validated through evaluations
Disentanglement of style and content enhances conversion quality
Abstract
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
