VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching
Ha-Yeong Choi, Jaehan Park

TL;DR
VoicePrompter introduces a novel zero-shot voice conversion system that uses voice prompts, feature disentanglement, and conditional flow matching to significantly improve speaker similarity and naturalness in unseen speakers.
Contribution
The paper presents a new zero-shot VC model that combines voice prompts, a factorization method, and a DiT-based conditional flow matching decoder, along with latent mixup for enhanced in-context learning.
Findings
Outperforms existing zero-shot VC systems in speaker similarity.
Improves speech naturalness and intelligibility.
Demonstrates robustness in zero-shot scenarios.
Abstract
Despite remarkable advancements in recent voice conversion (VC) systems, enhancing speaker similarity in zero-shot scenarios remains challenging. This challenge arises from the difficulty of generalizing and adapting speaker characteristics in speech within zero-shot environments, which is further complicated by mismatch between the training and inference processes. To address these challenges, we propose VoicePrompter, a robust zero-shot VC model that leverages in-context learning with voice prompts. VoicePrompter is composed of (1) a factorization method that disentangles speech components and (2) a DiT-based conditional flow matching (CFM) decoder that conditions on these factorized features and voice prompts. Additionally, (3) latent mixup is used to enhance in-context learning by combining various speaker features. This approach improves speaker similarity and naturalness in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsMixup
