ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

Yu Pan; Yanni Hu; Yuguang Yang; Jixun Yao; Jianhao Ye; Hongbin Zhou; Lei Ma; Jianjun Zhao

arXiv:2505.13805·cs.SD·May 21, 2025

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

Yu Pan, Yanni Hu, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao

PDF

Open Access

TL;DR

ClapFM-EVC is a novel emotional voice conversion framework that achieves high-fidelity, flexible, and interpretable emotional speech synthesis driven by natural language prompts or reference speech, with adjustable emotion intensity.

Contribution

It introduces EVC-CLAP for cross-modal emotional feature extraction and a flow matching model for high-quality speech reconstruction, advancing emotional voice conversion technology.

Findings

01

High-quality emotional speech conversion validated by evaluations

02

Flexible control via natural language prompts and reference speech

03

Enhanced emotion expressiveness and naturalness

Abstract

Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language-audio pre-training model, guided by natural language prompts and categorical labels, to extract and align fine-grained emotional elements across speech and text modalities. Then, a FuEncoder with an adaptive intensity gate is presented to seamless fuse emotional features with Phonetic PosteriorGrams from a pre-trained ASR model. To further improve emotion expressiveness and speech naturalness, we propose a flow matching model conditioned on these captured features to reconstruct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsALIGN