AEROMamba: An efficient architecture for audio super-resolution using generative adversarial networks and state space models
Wallace Abreu, Luiz Wagner Pereira Biscainho

TL;DR
AEROMamba introduces a novel architecture for audio super-resolution that replaces attention and LSTM modules with a state space model, significantly reducing memory usage and increasing speed while improving subjective audio quality.
Contribution
The paper presents AEROMamba, a new model that replaces traditional modules with a state space model, achieving faster inference, lower memory consumption, and better audio quality in super-resolution tasks.
Findings
14x faster inference compared to AERO
5x less GPU memory during training
Superior subjective listening scores on MUSDB and PianoEval datasets
Abstract
Audio super-resolution aims to enhance low-resolution signals by creating high-frequency content. In this work, we modify the architecture of AERO (a state-of-the-art system for this task) for music super-resolution. SPecifically, we replace its original Attention and LSTM layers with Mamba, a State Space Model (SSM), across all network layers. Mamba is capable of effectively substituting the mentioned modules, as it offers a mechanism similar to that of Attention while also functioning as a recurrent network. With the proposed AEROMamba, training requires 2-4x less GPU memory, since Mamba exploits the convolutional formulation and leverages GPU memory hierarchy. Additionally, during inference, Mamba operates in constant memory due to recurrence, avoiding memory growth associated with Attention. This results in a 14x speed improvement using 5x less GPU. Subjective listening tests (0 to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Aerodynamics and Acoustics in Jet Flows
MethodsSoftmax · Attention Is All You Need · Sigmoid Activation · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Tanh Activation · Long Short-Term Memory · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
