ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Xinfa Zhu, Lei He, Yujia Xiao, Xi Wang, Xu Tan, Sheng Zhao, Lei Xie

TL;DR
ZSVC introduces a zero-shot style voice conversion method leveraging disentangled latent diffusion models, speech prompting, and adversarial training to achieve diverse style transfer without prior style data.
Contribution
The paper proposes a novel zero-shot style voice conversion framework combining latent diffusion, speech prompting, and adversarial training for improved style transfer.
Findings
Outperforms existing methods in zero-shot style transfer quality
Effectively disentangles speaking style and speaker identity
Demonstrates robustness across diverse speaking styles on large-scale data
Abstract
Style voice conversion aims to transform the speaking style of source speech into a desired style while keeping the original speaker's identity. However, previous style voice conversion approaches primarily focus on well-defined domains such as emotional aspects, limiting their practical applications. In this study, we present ZSVC, a novel Zero-shot Style Voice Conversion approach that utilizes a speech codec and a latent diffusion model with speech prompting mechanism to facilitate in-context learning for speaking style conversion. To disentangle speaking style and speaker timbre, we introduce information bottleneck to filter speaking style in the source speech and employ Uncertainty Modeling Adaptive Instance Normalization (UMAdaIN) to perturb the speaker timbre in the style prompt. Moreover, we propose a novel adversarial training strategy to enhance in-context learning and improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsAdaptive Instance Normalization · Diffusion · Latent Diffusion Model · Focus · Instance Normalization
