Chunked Autoregressive GAN for Conditional Waveform Synthesis

Max Morrison; Rithesh Kumar; Kundan Kumar; Prem Seetharaman; Aaron; Courville; and Yoshua Bengio

arXiv:2110.10139·eess.AS·March 7, 2022·6 cites

Chunked Autoregressive GAN for Conditional Waveform Synthesis

Max Morrison, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron, Courville, and Yoshua Bengio

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CARGAN, a novel chunked autoregressive GAN for conditional waveform synthesis that significantly reduces pitch errors and training time while maintaining high-quality, real-time audio generation.

Contribution

The paper proposes CARGAN, a chunked autoregressive GAN that improves pitch accuracy and training efficiency in waveform synthesis compared to existing GAN models.

Findings

01

Reduces pitch error by 40-60%.

02

Decreases training time by 58%.

03

Maintains real-time generation speed.

Abstract

Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

descriptinc/cargan
pytorchOfficial

Videos

Chunked Autoregressive GAN for Conditional Waveform Synthesis· slideslive

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings