Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with   Progressive Constraints in a Dual-mode Training Strategy

Linhan Ma; Xinfa Zhu; Yuanjun Lv; Zhichao Wang; Ziqian Wang; Wendi He,; Hongbin Zhou; Lei Xie

arXiv:2406.09844·cs.SD·June 17, 2024

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He,, Hongbin Zhou, Lei Xie

PDF

Open Access

TL;DR

Vec-Tok-VC+ introduces a residual-enhanced, prompt-based zero-shot voice conversion model that leverages a dual-mode training strategy and progressive constraints to improve naturalness, content preservation, and speaker similarity.

Contribution

The paper presents Vec-Tok-VC+, a novel zero-shot VC model with residual-enhanced decoupling, dual-mode training, and multi-codebook progressive loss for improved performance.

Findings

01

Outperforms baselines in naturalness and speaker similarity

02

Achieves effective zero-shot conversion with only 3s target prompts

03

Enhances semantic content extraction and reduces training-inference mismatch

Abstract

Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while keeping the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling process as well as training-inference mismatch still hinder conversion performance. In this paper, we propose Vec-Tok-VC+, a novel prompt-based zero-shot VC model improved from Vec-Tok Codec, achieving voice conversion given only a 3s target speaker prompt. We design a residual-enhanced K-Means decoupler to enhance the semantic content extraction with a two-layer clustering process. Besides, we employ teacher-guided refinement to simulate the conversion process to eliminate the training-inference mismatch, forming a dual-mode training strategy. Furthermore, we design a multi-codebook progressive loss function to constrain the layer-wise output of the model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems