Chunked Autoregressive GAN for Conditional Waveform Synthesis
Max Morrison, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron, Courville, and Yoshua Bengio

TL;DR
This paper introduces CARGAN, a novel chunked autoregressive GAN for conditional waveform synthesis that significantly reduces pitch errors and training time while maintaining high-quality, real-time audio generation.
Contribution
The paper proposes CARGAN, a chunked autoregressive GAN that improves pitch accuracy and training efficiency in waveform synthesis compared to existing GAN models.
Findings
Reduces pitch error by 40-60%.
Decreases training time by 58%.
Maintains real-time generation speed.
Abstract
Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
