SSAMBA: Self-Supervised Audio Representation Learning with Mamba State   Space Model

Siavash Shams; Sukru Samet Dindar; Xilin Jiang; Nima Mesgarani

arXiv:2405.11831·eess.AS·February 6, 2025·1 cites

SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model

Siavash Shams, Sukru Samet Dindar, Xilin Jiang, Nima Mesgarani

PDF

Open Access 1 Repo 1 Models

TL;DR

SSAMBA introduces a novel self-supervised, attention-free audio representation model based on Mamba state space models, achieving superior performance and efficiency over transformer-based models in various audio tasks.

Contribution

It is the first to develop a self-supervised, SSM-based audio model that is attention-free and significantly more efficient than transformer counterparts.

Findings

01

Outperforms SSAST in multiple audio tasks.

02

Achieves 92.7% faster inference speed.

03

Uses 95.4% less memory for tiny models.

Abstract

Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities. However, they often suffer from quadratic complexity in both GPU memory usage and computational inference time, affecting their efficiency. Recently, state space models (SSMs) like Mamba have emerged as a promising alternative, offering a more efficient approach by avoiding these complexities. Given these advantages, we explore the potential of SSM-based models in audio tasks. In this paper, we introduce Self-Supervised Audio Mamba (SSAMBA), the first self-supervised, attention-free, and SSM-based model for audio representation learning. SSAMBA leverages the bidirectional Mamba to capture complex audio patterns effectively. We incorporate a self-supervised pretraining framework that optimizes both discriminative and generative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

siavashshams/ssamba
pytorchOfficial

Models

🤗
attentionisallyouneed369/ssamba
model· ♡ 4
♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsAttention Is All You Need · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Adam · Dropout