Enhancing In-the-Wild Speech Emotion Conversion with Resynthesis-based Duration Modeling

Navin Raj Prabhu; Danilo de Oliveira; Nale Lehmann-Willenbrock; Timo Gerkmann

arXiv:2508.11535·eess.AS·August 18, 2025

Enhancing In-the-Wild Speech Emotion Conversion with Resynthesis-based Duration Modeling

Navin Raj Prabhu, Danilo de Oliveira, Nale Lehmann-Willenbrock, Timo Gerkmann

PDF

TL;DR

This paper introduces a novel duration modeling framework using resynthesis-based discrete content representations to improve emotion conversion in speech, allowing controllable speech rates and enhanced expressiveness without parallel data.

Contribution

It presents a new duration modeling approach that effectively modifies speech duration to reflect target emotions in in-the-wild datasets, advancing emotion conversion methods.

Findings

01

Longer durations correlate with low-arousal emotions.

02

Shorter durations are associated with high-arousal emotions.

03

The proposed method significantly improves emotional expressiveness.

Abstract

Speech Emotion Conversion aims to modify the emotion expressed in input speech while preserving lexical content and speaker identity. Recently, generative modeling approaches have shown promising results in changing local acoustic properties such as fundamental frequency, spectral envelope and energy, but often lack the ability to control the duration of sounds. To address this, we propose a duration modeling framework using resynthesis-based discrete content representations, enabling modification of speech duration to reflect target emotions and achieve controllable speech rates without using parallel data. Experimental results reveal that the inclusion of the proposed duration modeling framework significantly enhances emotional expressiveness, in the in-the-wild MSP-Podcast dataset. Analyses show that low-arousal emotions correlate with longer durations and slower speech rates, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.