Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching

Yu Pan; Yuguang Yang; Jixun Yao; Lei Ma; Jianjun Zhao

arXiv:2411.02026·cs.SD·August 12, 2025

Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching

Yu Pan, Yuguang Yang, Jixun Yao, Lei Ma, Jianjun Zhao

PDF

Open Access

TL;DR

This paper introduces CTEFM-VC, a novel zero-shot voice conversion framework that combines content-aware timbre ensemble modeling with conditional flow matching to improve speaker similarity and naturalness.

Contribution

It presents a new zero-shot VC method that decouples content and timbre, uses a context-aware timbre ensemble with cross-attention, and employs a structural similarity timbre loss for end-to-end training.

Findings

01

Achieves superior speaker similarity and naturalness over state-of-the-art methods.

02

Effectively utilizes diverse speaker embeddings for timbre modeling.

03

Significantly outperforms existing zero-shot VC systems in all evaluated metrics.

Abstract

Despite recent advances in zero-shot voice conversion (VC), achieving speaker similarity and naturalness comparable to ground-truth recordings remains a significant challenge. In this letter, we propose CTEFM-VC, a zero-shot VC framework that integrates content-aware timbre ensemble modeling with conditional flow matching. Specifically, CTEFM-VC decouples utterances into content and timbre representations and leverages a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. To enhance its timbre modeling capability and naturalness of generated speech, we first introduce a context-aware timbre ensemble modeling approach that adaptively integrates diverse speaker verification embeddings and enables the effective utilization of source content and target timbre elements through a cross-attention module. Furthermore, a structural similarity-based timbre…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques