Learning to Upsample and Upmix Audio in the Latent Domain
Dimitrios Bralios, Paris Smaragdis, Jonah Casebeer

TL;DR
This paper introduces a novel framework for performing audio upsampling and upmixing directly within the latent space of neural autoencoders, significantly improving efficiency while maintaining quality.
Contribution
It proposes a latent domain processing approach that simplifies training and reduces computational costs for audio enhancement tasks.
Findings
Achieves up to 100x computational efficiency gains.
Maintains audio quality comparable to raw audio processing.
Validates the approach on bandwidth extension and mono-to-stereo up-mixing.
Abstract
Neural audio autoencoders create compact latent representations that preserve perceptually important information, serving as the foundation for both modern audio compression systems and generation approaches like next-token prediction and latent diffusion. Despite their prevalence, most audio processing operations, such as spatial and spectral up-sampling, still inefficiently operate on raw waveforms or spectral representations rather than directly on these compressed representations. We propose a framework that performs audio processing operations entirely within an autoencoder's latent space, eliminating the need to decode to raw audio formats. Our approach dramatically simplifies training by operating solely in the latent domain, with a latent L1 reconstruction term, augmented by a single latent adversarial discriminator. This contrasts sharply with raw-audio methods that typically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
