ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism

Hsing-Hang Chou; Yun-Shao Lin; Ching-Chin Sung; Yu Tsao; Chi-Chun Lee

arXiv:2409.03636·eess.AS·September 29, 2025·Interspeech

ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism

Hsing-Hang Chou, Yun-Shao Lin, Ching-Chin Sung, Yu Tsao, Chi-Chun Lee

PDF

Open Access

TL;DR

This paper presents ZSDEVC, a zero-shot diffusion-based emotional voice conversion method that effectively converts emotions in speech for unseen speakers, achieving high emotional accuracy and naturalness.

Contribution

It introduces a novel diffusion framework with disentangled mechanisms and expressive guidance for zero-shot emotional voice conversion, trained on large emotional speech datasets.

Findings

01

High emotional accuracy in converted speech

02

Enhanced naturalness and speech quality

03

Effective zero-shot conversion for unseen speakers

Abstract

The human voice conveys not just words but also emotional states and individuality. Emotional voice conversion (EVC) modifies emotional expressions while preserving linguistic content and speaker identity, improving applications like human-machine interaction. While deep learning has advanced EVC models for specific target speakers on well-crafted emotional datasets, existing methods often face issues with emotion accuracy and speech distortion. In addition, the zero-shot scenario, in which emotion conversion is applied to unseen speakers, remains underexplored. This work introduces a novel diffusion framework with disentangled mechanisms and expressive guidance, trained on a large emotional speech dataset and evaluated on unseen speakers across in-domain and out-of-domain datasets. Experimental results show that our method produces expressive speech with high emotional accuracy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques

MethodsDiffusion