GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models
Matthew Baas, Herman Kamper

TL;DR
This paper introduces AudioStyleGAN (ASGAN), a GAN-based model for high-quality, fast, and disentangled unconditional speech synthesis, outperforming diffusion models and enabling voice conversion and speech editing.
Contribution
We develop ASGAN, a novel GAN architecture for speech synthesis that incorporates new training techniques and achieves state-of-the-art results while enabling voice manipulation tasks.
Findings
ASGAN achieves state-of-the-art results on Google Speech Commands dataset.
ASGAN is substantially faster than diffusion models.
ASGAN enables voice conversion and speech editing without explicit training.
Abstract
We propose AudioStyleGAN (ASGAN), a new generative adversarial network (GAN) for unconditional speech synthesis. As in the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation to probabilistically skip discriminator updates. ASGAN achieves state-of-the-art results in unconditional speech synthesis on the Google Speech Commands dataset. It is also substantially faster than the top-performing diffusion models. Through a design that encourages disentanglement, ASGAN is able to perform voice conversion and speech editing without being explicitly trained to do so. ASGAN demonstrates that GANs are still highly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsStyleGAN · Dense Connections · Feedforward Network · Convolution · Adaptive Instance Normalization · R1 Regularization · HuMan(Expedia)||How do I get a human at Expedia? · Diffusion
