RAVE: A variational autoencoder for fast and high-quality neural audio synthesis
Antoine Caillon, Philippe Esling

TL;DR
RAVE is a novel variational autoencoder that enables fast, high-quality 48kHz audio synthesis, offering controllable generation and applications in timbre transfer and compression, outperforming existing models in speed and quality.
Contribution
Introducing RAVE, a two-stage trained VAE that achieves real-time, high-fidelity audio synthesis at 48kHz with controllable latent space and novel multi-band waveform decomposition.
Findings
Generates 48kHz audio at 20x real-time speed on a CPU.
Outperforms existing models in synthesis quality.
Enables applications like timbre transfer and signal compression.
Abstract
Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
