Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations
Sarthak Yadav, Zheng-Hua Tan

TL;DR
Audio Mamba introduces a novel selective state space model for self-supervised learning of general-purpose audio representations, outperforming existing spectrogram transformer baselines across multiple audio recognition tasks.
Contribution
It is the first to adapt selective state space models for self-supervised audio representation learning, demonstrating superior performance over traditional transformer-based methods.
Findings
Outperforms SSAST baselines on ten audio recognition tasks
Shows robustness across different dataset sizes, sequence lengths, and model sizes
Pretrained on AudioSet, achieves significant accuracy improvements
Abstract
Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state space models, which have demonstrated promising results for language modelling. However, their feasibility for learning self-supervised, general-purpose audio representations is yet to be investigated. This work proposes Audio Mamba, a selective state space model for learning general-purpose audio representations from randomly masked spectrogram patches through self-supervision. Empirical results on ten diverse audio recognition downstream tasks show that the proposed models, pretrained on the AudioSet dataset, consistently outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by a considerable margin and demonstrate better performance in dataset size, sequence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention
