Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS
Myeongjin Ko, Yong-Hoon Choi

TL;DR
This paper introduces a novel adversarial training approach for diffusion-based multi-speaker TTS using dual discriminators, significantly improving speech quality and synthesis speed over existing models.
Contribution
It proposes a diffusion speech synthesis model with two discriminators, enhancing the distribution learning of the reverse process and generated data, leading to superior performance.
Findings
Outperforms state-of-the-art models like FastSpeech2 and DiffGAN-TTS in multiple metrics.
Achieves higher speech quality and synthesis speed.
Demonstrates effectiveness through both objective and subjective evaluations.
Abstract
The diffusion model is capable of generating high-quality data through a probabilistic approach. However, it suffers from the drawback of slow generation speed due to the requirement of a large number of time steps. To address this limitation, recent models such as denoising diffusion implicit models (DDIM) focus on generating samples without directly modeling the probability distribution, while models like denoising diffusion generative adversarial networks (GAN) combine diffusion processes with GANs. In the field of speech synthesis, a recent diffusion speech synthesis model called DiffGAN-TTS, utilizing the structure of GANs, has been introduced and demonstrates superior performance in both speech quality and generation speed. In this paper, to further enhance the performance of DiffGAN-TTS, we propose a speech synthesis model with two discriminators: a diffusion discriminator for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsDiffusion · Focus · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
