GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from   Diffusion Models

Matthew Baas; Herman Kamper

arXiv:2210.05271·cs.SD·October 12, 2022

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

Matthew Baas, Herman Kamper

PDF

Open Access 1 Repo

TL;DR

This paper introduces AudioStyleGAN (ASGAN), a GAN-based model for high-quality, fast, and disentangled unconditional speech synthesis, outperforming diffusion models and enabling voice conversion and speech editing.

Contribution

We develop ASGAN, a novel GAN architecture for speech synthesis that incorporates new training techniques and achieves state-of-the-art results while enabling voice manipulation tasks.

Findings

01

ASGAN achieves state-of-the-art results on Google Speech Commands dataset.

02

ASGAN is substantially faster than diffusion models.

03

ASGAN enables voice conversion and speech editing without explicit training.

Abstract

We propose AudioStyleGAN (ASGAN), a new generative adversarial network (GAN) for unconditional speech synthesis. As in the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation to probabilistically skip discriminator updates. ASGAN achieves state-of-the-art results in unconditional speech synthesis on the Google Speech Commands dataset. It is also substantially faster than the top-performing diffusion models. Through a design that encourages disentanglement, ASGAN is able to perform voice conversion and speech editing without being explicitly trained to do so. ASGAN demonstrates that GANs are still highly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rf5/simple-asgan
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsStyleGAN · Dense Connections · Feedforward Network · Convolution · Adaptive Instance Normalization · R1 Regularization · HuMan(Expedia)||How do I get a human at Expedia? · Diffusion