ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent   Diffusion Models and Adversarial Training

Xinfa Zhu; Lei He; Yujia Xiao; Xi Wang; Xu Tan; Sheng Zhao; Lei Xie

arXiv:2501.04416·eess.AS·January 9, 2025

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

Xinfa Zhu, Lei He, Yujia Xiao, Xi Wang, Xu Tan, Sheng Zhao, Lei Xie

PDF

Open Access

TL;DR

ZSVC introduces a zero-shot style voice conversion method leveraging disentangled latent diffusion models, speech prompting, and adversarial training to achieve diverse style transfer without prior style data.

Contribution

The paper proposes a novel zero-shot style voice conversion framework combining latent diffusion, speech prompting, and adversarial training for improved style transfer.

Findings

01

Outperforms existing methods in zero-shot style transfer quality

02

Effectively disentangles speaking style and speaker identity

03

Demonstrates robustness across diverse speaking styles on large-scale data

Abstract

Style voice conversion aims to transform the speaking style of source speech into a desired style while keeping the original speaker's identity. However, previous style voice conversion approaches primarily focus on well-defined domains such as emotional aspects, limiting their practical applications. In this study, we present ZSVC, a novel Zero-shot Style Voice Conversion approach that utilizes a speech codec and a latent diffusion model with speech prompting mechanism to facilitate in-context learning for speaking style conversion. To disentangle speaking style and speaker timbre, we introduce information bottleneck to filter speaking style in the source speech and employ Uncertainty Modeling Adaptive Instance Normalization (UMAdaIN) to perturb the speaker timbre in the style prompt. Moreover, we propose a novel adversarial training strategy to enhance in-context learning and improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsAdaptive Instance Normalization · Diffusion · Latent Diffusion Model · Focus · Instance Normalization