Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

Kaidi Wang; Wenhao Guan; Ziyue Jiang; Hukai Huang; Peijie Chen; Weijie Wu; Qingyang Hong; Lin Li

arXiv:2505.24291·cs.SD·June 2, 2025

Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion

Kaidi Wang, Wenhao Guan, Ziyue Jiang, Hukai Huang, Peijie Chen, Weijie Wu, Qingyang Hong, Lin Li

PDF

Open Access

TL;DR

Discl-VC introduces a novel zero-shot voice conversion framework that disentangles content and prosody, enabling precise control over speech style and prosody through in-context learning and discrete token prediction.

Contribution

The paper presents Discl-VC, a new method that improves controllability and accuracy in zero-shot voice conversion by disentangling speech features and using a mask transformer for prosody control.

Findings

01

Outperforms existing methods in zero-shot voice conversion accuracy.

02

Achieves precise prosody control in synthesized speech.

03

Demonstrates superior performance in experimental evaluations.

Abstract

Currently, zero-shot voice conversion systems are capable of synthesizing the voice of unseen speakers. However, most existing approaches struggle to accurately replicate the speaking style of the source speaker or mimic the distinctive speaking style of the target speaker, thereby limiting the controllability of voice conversion. In this work, we propose Discl-VC, a novel voice conversion framework that disentangles content and prosody information from self-supervised speech representations and synthesizes the target speaker's voice through in-context learning with a flow matching transformer. To enable precise control over the prosody of generated speech, we introduce a mask generative transformer that predicts discrete prosody tokens in a non-autoregressive manner based on prompts. Experimental results demonstrate the superior performance of Discl-VC in zero-shot voice conversion and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques