EmoReg: Directional Latent Vector Modeling for Emotional Intensity   Regularization in Diffusion-based Voice Conversion

Ashishkumar Gudmalwar; Ishan D. Biyani; Nirmesh Shah; Pankaj Wasnik,; Rajiv Ratn Shah

arXiv:2412.20359·eess.AS·December 31, 2024

EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion

Ashishkumar Gudmalwar, Ishan D. Biyani, Nirmesh Shah, Pankaj Wasnik,, Rajiv Ratn Shah

PDF

Open Access 1 Video

TL;DR

This paper introduces EmoReg, a novel diffusion-based emotional voice conversion method that uses self-supervised features and directional latent vectors to precisely control emotional intensity, improving speech quality and emotional accuracy.

Contribution

It proposes the first emotion intensity regularization technique in diffusion-based voice conversion using unsupervised latent vector modeling and self-supervised features.

Findings

01

Outperforms state-of-the-art baselines in subjective and objective evaluations.

02

Effective in English and Hindi languages.

03

Generates high-quality speech with controlled emotional intensity.

Abstract

The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance while preserving linguistic content. In this paper, we propose regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion. Traditional approaches control the intensity of an emotional state in the utterance via emotion class probabilities or intensity labels that often lead to inept style manipulations and degradations in quality. On the contrary, we aim to regulate emotion intensity using self-supervised learning-based feature representations and unsupervised directional latent vector modeling (DVM) in the emotional embedding space within a diffusion-based framework. These emotion embeddings can be modified based on the given target emotion intensity and the corresponding direction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing

MethodsDiffusion