Audio Mamba: Bidirectional State Space Model for Audio Representation   Learning

Mehmet Hamza Erol; Arda Senocak; Jiu Feng; Joon Son Chung

arXiv:2406.03344·cs.SD·June 6, 2024·2 cites

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung

PDF

Open Access 1 Repo 7 Models

TL;DR

This paper introduces Audio Mamba, a novel self-attention-free, state space model for audio classification that matches or exceeds the performance of transformer-based models across multiple benchmarks.

Contribution

It presents the first purely SSM-based model for audio classification, challenging the necessity of self-attention in this domain.

Findings

01

AuM achieves comparable or better performance than AST models.

02

AuM scales better due to absence of quadratic self-attention complexity.

03

The model performs well across six diverse audio datasets.

Abstract

Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs. However, Audio Spectrogram Transformers (ASTs) exhibit quadratic scaling due to self-attention. The removal of this quadratic self-attention cost presents an appealing direction. Recently, state space models (SSMs), such as Mamba, have demonstrated potential in language and vision tasks in this regard. In this study, we explore whether reliance on self-attention is necessary for audio classification tasks. By introducing Audio Mamba (AuM), the first self-attention-free, purely SSM-based model for audio classification, we aim to address this question. We evaluate AuM on various audio datasets - comprising six different benchmarks - where it achieves comparable or better performance compared to well-established AST model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mhamzaerol/audio-mamba-aum
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies