Disentangling the Prosody and Semantic Information with Pre-trained   Model for In-Context Learning based Zero-Shot Voice Conversion

Zhengyang Chen; Shuai Wang; Mingyang Zhang; Xuechen Liu; Junichi; Yamagishi; Yanmin Qian

arXiv:2409.05004·cs.SD·September 11, 2024

Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion

Zhengyang Chen, Shuai Wang, Mingyang Zhang, Xuechen Liu, Junichi, Yamagishi, Yanmin Qian

PDF

Open Access

TL;DR

This paper introduces a zero-shot voice conversion system leveraging in-context learning and semantic tokens, enhanced with prosody embeddings to better preserve speech prosody, demonstrating improved speaker similarity and prosody retention.

Contribution

The paper proposes a novel ICL-based voice conversion system with a mask and reconstruction training strategy, integrating prosody embeddings to better preserve source speech prosody.

Findings

01

ICL-VC improves speaker similarity in voice conversion.

02

K-means is effective for tokenization across pre-trained models.

03

Incorporating prosody embeddings enhances prosody preservation.

Abstract

Voice conversion (VC) aims to modify the speaker's timbre while retaining speech content. Previous approaches have tokenized the outputs from self-supervised into semantic tokens, facilitating disentanglement of speech content information. Recently, in-context learning (ICL) has emerged in text-to-speech (TTS) systems for effectively modeling specific characteristics such as timbre through context conditioning. This paper proposes an ICL capability enhanced VC system (ICL-VC) employing a mask and reconstruction training strategy based on flow-matching generative models. Augmented with semantic tokens, our experiments on the LibriTTS dataset demonstrate that ICL-VC improves speaker similarity. Additionally, we find that k-means is a versatile tokenization method applicable to various pre-trained models. However, the ICL-VC system faces challenges in preserving the prosody of the source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis