HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit

Khushiyant; Param Thakkar

arXiv:2603.21316·cs.SD·March 24, 2026

HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit

Khushiyant, Param Thakkar

PDF

Open Access

TL;DR

HELIX introduces a hybrid Mamba-attention framework for audio understanding, demonstrating how input representation and sequence length influence model performance and scalability across multiple datasets.

Contribution

The paper presents HELIX, a controlled hybrid architecture that isolates effects of Mamba and attention, revealing their interaction with input representation and sequence length in audio tasks.

Findings

01

Attention improves performance on long, non-stationary audio sequences.

02

Pure attention models face memory issues on long sequences.

03

Hybrid models outperform pure Mamba and pure attention in large-scale tasks.

Abstract

Audio representation learning typically evaluates design choices such as input frontend, sequence backbone, and sequence length in isolation. We show that these axes are coupled, and conclusions from one setting often do not transfer to others. We introduce HELIX, a controlled framework comparing pure Mamba, pure attention, and a minimal hybrid with a single attention bottleneck. All models are parameter-matched at about 8.3M parameters to isolate architectural effects. Across six datasets, we find that the preferred input representation depends on the backbone, and that attention hurts performance on short, stationary audio but becomes important at longer sequence lengths. On a 5-minute speaker identification task with 30,000 tokens, pure attention fails with out-of-memory errors, while HELIX closes an 11.5-point gap over pure Mamba.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing