Audio Mamba: Selective State Spaces for Self-Supervised Audio   Representations

Sarthak Yadav; Zheng-Hua Tan

arXiv:2406.02178·cs.SD·June 11, 2024

Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations

Sarthak Yadav, Zheng-Hua Tan

PDF

Open Access 1 Repo

TL;DR

Audio Mamba introduces a novel selective state space model for self-supervised learning of general-purpose audio representations, outperforming existing spectrogram transformer baselines across multiple audio recognition tasks.

Contribution

It is the first to adapt selective state space models for self-supervised audio representation learning, demonstrating superior performance over traditional transformer-based methods.

Findings

01

Outperforms SSAST baselines on ten audio recognition tasks

02

Shows robustness across different dataset sizes, sequence lengths, and model sizes

03

Pretrained on AudioSet, achieves significant accuracy improvements

Abstract

Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state space models, which have demonstrated promising results for language modelling. However, their feasibility for learning self-supervised, general-purpose audio representations is yet to be investigated. This work proposes Audio Mamba, a selective state space model for learning general-purpose audio representations from randomly masked spectrogram patches through self-supervision. Empirical results on ten diverse audio recognition downstream tasks show that the proposed models, pretrained on the AudioSet dataset, consistently outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by a considerable margin and demonstrate better performance in dataset size, sequence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SarthakYadav/audio-mamba-official
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention