Disentanglement in a GAN for Unconditional Speech Synthesis
Matthew Baas, Herman Kamper

TL;DR
This paper introduces AudioStyleGAN (ASGAN), a GAN model for unconditional speech synthesis with a disentangled latent space, achieving state-of-the-art results on small-vocabulary datasets and enabling various speech tasks.
Contribution
The paper presents ASGAN, a novel GAN architecture for unconditional speech synthesis with a disentangled latent space and new training techniques, outperforming existing models in speed and quality.
Findings
Achieves state-of-the-art results on Google Speech Commands dataset.
Faster than existing diffusion models for speech synthesis.
Disentangled latent space enables tasks like voice conversion and speech enhancement.
Abstract
Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsHuMan(Expedia)||How do I get a human at Expedia? · Dense Connections · Adaptive Instance Normalization · Convolution · R1 Regularization · Feedforward Network · StyleGAN · Diffusion
