Zero-Shot Voice Conversion via Content-Aware Timbre Ensemble and Conditional Flow Matching
Yu Pan, Yuguang Yang, Jixun Yao, Lei Ma, Jianjun Zhao

TL;DR
This paper introduces CTEFM-VC, a novel zero-shot voice conversion framework that combines content-aware timbre ensemble modeling with conditional flow matching to improve speaker similarity and naturalness.
Contribution
It presents a new zero-shot VC method that decouples content and timbre, uses a context-aware timbre ensemble with cross-attention, and employs a structural similarity timbre loss for end-to-end training.
Findings
Achieves superior speaker similarity and naturalness over state-of-the-art methods.
Effectively utilizes diverse speaker embeddings for timbre modeling.
Significantly outperforms existing zero-shot VC systems in all evaluated metrics.
Abstract
Despite recent advances in zero-shot voice conversion (VC), achieving speaker similarity and naturalness comparable to ground-truth recordings remains a significant challenge. In this letter, we propose CTEFM-VC, a zero-shot VC framework that integrates content-aware timbre ensemble modeling with conditional flow matching. Specifically, CTEFM-VC decouples utterances into content and timbre representations and leverages a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. To enhance its timbre modeling capability and naturalness of generated speech, we first introduce a context-aware timbre ensemble modeling approach that adaptively integrates diverse speaker verification embeddings and enables the effective utilization of source content and target timbre elements through a cross-attention module. Furthermore, a structural similarity-based timbre…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
