EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel   and In-the-wild Data

Navin Raj Prabhu; Bunlong Lay; Simon Welker; Nale Lehmann-Willenbrock; and Timo Gerkmann

arXiv:2309.07828·eess.AS·January 9, 2024

EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

Navin Raj Prabhu, Bunlong Lay, Simon Welker, Nale Lehmann-Willenbrock, and Timo Gerkmann

PDF

Open Access

TL;DR

This paper introduces EmoConv-Diff, a diffusion-based model for speech emotion conversion that works on in-the-wild data without parallel samples, using continuous arousal for emotion representation and control.

Contribution

It presents a novel diffusion model for non-parallel speech emotion conversion using continuous arousal, improving emotion intensity control and performance on in-the-wild data.

Findings

01

Effective emotion conversion with continuous arousal representation.

02

Improved performance at extreme arousal values.

03

Capable of in-the-wild speech emotion synthesis.

Abstract

Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To this end, we propose a diffusion-based generative model for speech emotion conversion, the EmoConv-Diff, that is trained to reconstruct an input utterance while also conditioning on its emotion. Subsequently, at inference, a target emotion embedding is employed to convert the emotion of the input utterance to the given target emotion. As opposed to performing emotion conversion on categorical representations, we use a continuous arousal dimension to represent emotions while also achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders

MethodsFocus · Diffusion