Towards Realistic Emotional Voice Conversion using Controllable   Emotional Intensity

Tianhua Qi; Shiyan Wang; Cheng Lu; Yan Zhao; Yuan Zong; Wenming Zheng

arXiv:2407.14800·eess.AS·July 23, 2024

Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity

Tianhua Qi, Shiyan Wang, Cheng Lu, Yan Zhao, Yuan Zong, Wenming Zheng

PDF

Open Access

TL;DR

This paper introduces EINet, a novel emotional voice conversion model that dynamically controls emotional intensity to produce more natural and diverse emotional speech, utilizing emotion evaluation and intensity mapping for precise modulation.

Contribution

The paper presents EINet, a new emotional voice conversion framework that incorporates controllable emotional intensity and a comprehensive emotion evaluation mechanism for improved speech naturalness.

Findings

01

EINet outperforms existing methods in naturalness and emotional diversity.

02

The use of emotion evaluator and intensity mapper improves emotional nuance accuracy.

03

Adaptive duration prediction enhances rhythm and prosody control.

Abstract

Realistic emotional voice conversion (EVC) aims to enhance emotional diversity of converted audios, making the synthesized voices more authentic and natural. To this end, we propose Emotional Intensity-aware Network (EINet), dynamically adjusting intonation and rhythm by incorporating controllable emotional intensity. To better capture nuances in emotional intensity, we go beyond mere distance measurements among acoustic features. Instead, an emotion evaluator is utilized to precisely quantify speaker's emotional state. By employing an intensity mapper, intensity pseudo-labels are obtained to bridge the gap between emotional speech intensity modeling and run-time conversion. To ensure high speech quality while retaining controllability, an emotion renderer is used for combining linguistic features smoothly with manipulated emotional features at frame level. Furthermore, we employ a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis