Adversarial Training of Denoising Diffusion Model Using Dual   Discriminators for High-Fidelity Multi-Speaker TTS

Myeongjin Ko; Yong-Hoon Choi

arXiv:2308.01573·cs.SD·April 30, 2024

Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS

Myeongjin Ko, Yong-Hoon Choi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel adversarial training approach for diffusion-based multi-speaker TTS using dual discriminators, significantly improving speech quality and synthesis speed over existing models.

Contribution

It proposes a diffusion speech synthesis model with two discriminators, enhancing the distribution learning of the reverse process and generated data, leading to superior performance.

Findings

01

Outperforms state-of-the-art models like FastSpeech2 and DiffGAN-TTS in multiple metrics.

02

Achieves higher speech quality and synthesis speed.

03

Demonstrates effectiveness through both objective and subjective evaluations.

Abstract

The diffusion model is capable of generating high-quality data through a probabilistic approach. However, it suffers from the drawback of slow generation speed due to the requirement of a large number of time steps. To address this limitation, recent models such as denoising diffusion implicit models (DDIM) focus on generating samples without directly modeling the probability distribution, while models like denoising diffusion generative adversarial networks (GAN) combine diffusion processes with GANs. In the field of speech synthesis, a recent diffusion speech synthesis model called DiffGAN-TTS, utilizing the structure of GANs, has been introduced and demonstrates superior performance in both speech quality and generation speed. In this paper, to further enhance the performance of DiffGAN-TTS, we propose a speech synthesis model with two discriminators: a diffusion discriminator for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

komyeongjin/specdiff-gan
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsDiffusion · Focus · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings