DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion
Ha-Yeong Choi, Sang-Hoon Lee, Seong-Whan Lee

TL;DR
This paper introduces DDDM-VC, a novel voice conversion method using decoupled diffusion models with disentangled representations and prior mixup, enabling precise style control and robustness across various speech attributes.
Contribution
It proposes a decoupled diffusion framework with disentangled speech representations and a prior mixup technique for improved, controllable, and robust voice conversion.
Findings
Outperforms existing VC models in quality.
Provides robust performance across different model sizes.
Enables precise control over speech attributes.
Abstract
Diffusion-based generative models have exhibited powerful generative performance in recent years. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, this paper presents decoupled denoising diffusion models (DDDMs) with disentangled representations, which can control the style for each attribute in generative models. We apply DDDMs to voice conversion (VC) tasks to address the challenges of disentangling and controlling each speech attribute (e.g., linguistic information, intonation, and timbre). First, we use a self-supervised representation to disentangle the speech representation. Subsequently, the DDDMs are applied to resynthesize the speech from the disentangled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion · Mixup
