Disentanglement in a GAN for Unconditional Speech Synthesis

Matthew Baas; Herman Kamper

arXiv:2307.01673·eess.AS·January 26, 2024·1 cites

Disentanglement in a GAN for Unconditional Speech Synthesis

Matthew Baas, Herman Kamper

PDF

Open Access 1 Repo

TL;DR

This paper introduces AudioStyleGAN (ASGAN), a GAN model for unconditional speech synthesis with a disentangled latent space, achieving state-of-the-art results on small-vocabulary datasets and enabling various speech tasks.

Contribution

The paper presents ASGAN, a novel GAN architecture for unconditional speech synthesis with a disentangled latent space and new training techniques, outperforming existing models in speed and quality.

Findings

01

Achieves state-of-the-art results on Google Speech Commands dataset.

02

Faster than existing diffusion models for speech synthesis.

03

Disentangled latent space enables tasks like voice conversion and speech enhancement.

Abstract

Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rf5/simple-asgan
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsHuMan(Expedia)||How do I get a human at Expedia? · Dense Connections · Adaptive Instance Normalization · Convolution · R1 Regularization · Feedforward Network · StyleGAN · Diffusion