Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy
Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He,, Hongbin Zhou, Lei Xie

TL;DR
Vec-Tok-VC+ introduces a residual-enhanced, prompt-based zero-shot voice conversion model that leverages a dual-mode training strategy and progressive constraints to improve naturalness, content preservation, and speaker similarity.
Contribution
The paper presents Vec-Tok-VC+, a novel zero-shot VC model with residual-enhanced decoupling, dual-mode training, and multi-codebook progressive loss for improved performance.
Findings
Outperforms baselines in naturalness and speaker similarity
Achieves effective zero-shot conversion with only 3s target prompts
Enhances semantic content extraction and reduces training-inference mismatch
Abstract
Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while keeping the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling process as well as training-inference mismatch still hinder conversion performance. In this paper, we propose Vec-Tok-VC+, a novel prompt-based zero-shot VC model improved from Vec-Tok Codec, achieving voice conversion given only a 3s target speaker prompt. We design a residual-enhanced K-Means decoupler to enhance the semantic content extraction with a two-layer clustering process. Besides, we employ teacher-guided refinement to simulate the conversion process to eliminate the training-inference mismatch, forming a dual-mode training strategy. Furthermore, we design a multi-codebook progressive loss function to constrain the layer-wise output of the model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
