Voice Conversion with Denoising Diffusion Probabilistic GAN Models
Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

TL;DR
This paper introduces DiffGAN-VC, a novel model combining GANs and diffusion models for non-parallel, many-to-many voice conversion, achieving high quality and diversity in generated speech.
Contribution
It proposes a new DiffGAN-VC model that integrates large-step denoising diffusion with multimodal conditional GANs for improved voice conversion.
Findings
DiffGAN-VC outperforms CycleGAN-VC in voice quality and speaker similarity.
The model demonstrates high naturalness and diversity in non-parallel datasets.
Objective and subjective evaluations confirm the effectiveness of DiffGAN-VC.
Abstract
Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative Adversarial Networks (GANs) can quickly generate high-quality samples, but the generated samples lack diversity. The samples generated by the Denoising Diffusion Probabilistic Models (DDPMs) are better than GANs in terms of mode coverage and sample diversity. But the DDPMs have high computational costs and the inference speed is slower than GANs. In order to make GANs and DDPMs more practical we proposes DiffGAN-VC, a variant of GANs and DDPMS, to achieve non-parallel many-to-many voice conversion (VC). We use large steps to achieve denoising, and also introduce a multimodal conditional GANs to model the denoising diffusion generative adversarial network.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
