PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts

Tianhua Qi; Shiyan Wang; Cheng Lu; Tengfei Song; Hao Yang; Zhanglin Wu; Wenming Zheng

arXiv:2505.20678·eess.AS·May 28, 2025

PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts

Tianhua Qi, Shiyan Wang, Cheng Lu, Tengfei Song, Hao Yang, Zhanglin Wu, Wenming Zheng

PDF

Open Access

TL;DR

PromptEVC introduces a novel method for controllable emotional voice conversion using natural language prompts, enabling precise emotion manipulation and improved naturalness in synthesized speech.

Contribution

It proposes a new framework that uses natural language prompts and emotion descriptors to achieve flexible and fine-grained emotion control in voice conversion.

Findings

01

Outperforms state-of-the-art methods in emotion conversion accuracy

02

Enables detailed control over emotion intensity and mixed emotions

03

Enhances naturalness through prosody modeling and speaker identity preservation

Abstract

Controllable emotional voice conversion (EVC) aims to manipulate emotional expressions to increase the diversity of synthesized speech. Existing methods typically rely on predefined labels, reference audios, or prespecified factor values, often overlooking individual differences in emotion perception and expression. In this paper, we introduce PromptEVC that utilizes natural language prompts for precise and flexible emotion control. To bridge text descriptions with emotional speech, we propose emotion descriptor and prompt mapper to generate fine-grained emotion embeddings, trained jointly with reference embeddings. To enhance naturalness, we present a prosody modeling and control pipeline that adjusts the rhythm based on linguistic content and emotional cues. Additionally, a speaker encoder is incorporated to preserve identity. Experimental results demonstrate that PromptEVC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing