Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion
Zhengyang Chen, Shuai Wang, Mingyang Zhang, Xuechen Liu, Junichi, Yamagishi, Yanmin Qian

TL;DR
This paper introduces a zero-shot voice conversion system leveraging in-context learning and semantic tokens, enhanced with prosody embeddings to better preserve speech prosody, demonstrating improved speaker similarity and prosody retention.
Contribution
The paper proposes a novel ICL-based voice conversion system with a mask and reconstruction training strategy, integrating prosody embeddings to better preserve source speech prosody.
Findings
ICL-VC improves speaker similarity in voice conversion.
K-means is effective for tokenization across pre-trained models.
Incorporating prosody embeddings enhances prosody preservation.
Abstract
Voice conversion (VC) aims to modify the speaker's timbre while retaining speech content. Previous approaches have tokenized the outputs from self-supervised into semantic tokens, facilitating disentanglement of speech content information. Recently, in-context learning (ICL) has emerged in text-to-speech (TTS) systems for effectively modeling specific characteristics such as timbre through context conditioning. This paper proposes an ICL capability enhanced VC system (ICL-VC) employing a mask and reconstruction training strategy based on flow-matching generative models. Augmented with semantic tokens, our experiments on the LibriTTS dataset demonstrate that ICL-VC improves speaker similarity. Additionally, we find that k-means is a versatile tokenization method applicable to various pre-trained models. However, the ICL-VC system faces challenges in preserving the prosody of the source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
