A Generative-First Neural Audio Autoencoder

Jonah Casebeer; Ge Zhu; Zhepei Wang; Nicholas J. Bryan

arXiv:2602.15749·cs.SD·February 23, 2026

A Generative-First Neural Audio Autoencoder

Jonah Casebeer, Ge Zhu, Zhepei Wang, Nicholas J. Bryan

PDF

Open Access

TL;DR

This paper presents a generative-first neural audio autoencoder that significantly improves encoding speed, reduces latent rates, and unifies representations across formats, enabling more efficient and versatile audio generative modeling.

Contribution

It introduces a novel architecture that increases temporal downsampling and supports multiple formats in one model, balancing compression, quality, and speed.

Findings

01

10x faster encoding compared to previous methods

02

1.6x lower latent rates while maintaining quality

03

Unified model for various audio formats

Abstract

Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech Recognition and Synthesis