RAVE: A variational autoencoder for fast and high-quality neural audio   synthesis

Antoine Caillon; Philippe Esling

arXiv:2111.05011·cs.LG·December 16, 2021·5 cites

RAVE: A variational autoencoder for fast and high-quality neural audio synthesis

Antoine Caillon, Philippe Esling

PDF

Open Access 3 Repos 2 Models

TL;DR

RAVE is a novel variational autoencoder that enables fast, high-quality 48kHz audio synthesis, offering controllable generation and applications in timbre transfer and compression, outperforming existing models in speed and quality.

Contribution

Introducing RAVE, a two-stage trained VAE that achieves real-time, high-fidelity audio synthesis at 48kHz with controllable latent space and novel multi-band waveform decomposition.

Findings

01

Generates 48kHz audio at 20x real-time speed on a CPU.

02

Outperforms existing models in synthesis quality.

03

Enables applications like timbre transfer and signal compression.

Abstract

Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis