Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung

TL;DR
This paper introduces Audio Mamba, a novel self-attention-free, state space model for audio classification that matches or exceeds the performance of transformer-based models across multiple benchmarks.
Contribution
It presents the first purely SSM-based model for audio classification, challenging the necessity of self-attention in this domain.
Findings
AuM achieves comparable or better performance than AST models.
AuM scales better due to absence of quadratic self-attention complexity.
The model performs well across six diverse audio datasets.
Abstract
Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs. However, Audio Spectrogram Transformers (ASTs) exhibit quadratic scaling due to self-attention. The removal of this quadratic self-attention cost presents an appealing direction. Recently, state space models (SSMs), such as Mamba, have demonstrated potential in language and vision tasks in this regard. In this study, we explore whether reliance on self-attention is necessary for audio classification tasks. By introducing Audio Mamba (AuM), the first self-attention-free, purely SSM-based model for audio classification, we aim to address this question. We evaluate AuM on various audio datasets - comprising six different benchmarks - where it achieves comparable or better performance compared to well-established AST model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Robzy/audiomambamodel· 3 dl· ♡ 13 dl♡ 1
- 🤗saurabhati/DASS_small_AudioSet_47.2model· 2 dl· ♡ 12 dl♡ 1
- 🤗saurabhati/DASS_medium_AudioSet_47.6model· 2 dl2 dl
- 🤗saurabhati/DASS_small_AudioSet_48.6model· 10 dl10 dl
- 🤗saurabhati/DASS_medium_AudioSet_48.9model
- 🤗saurabhati/DASS_small_AudioSet_50.1model· 45 dl45 dl
- 🤗saurabhati/DASS_medium_AudioSet_50.2model· 53 dl· ♡ 253 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
