SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
Siavash Shams, Sukru Samet Dindar, Xilin Jiang, Nima Mesgarani

TL;DR
SSAMBA introduces a novel self-supervised, attention-free audio representation model based on Mamba state space models, achieving superior performance and efficiency over transformer-based models in various audio tasks.
Contribution
It is the first to develop a self-supervised, SSM-based audio model that is attention-free and significantly more efficient than transformer counterparts.
Findings
Outperforms SSAST in multiple audio tasks.
Achieves 92.7% faster inference speed.
Uses 95.4% less memory for tiny models.
Abstract
Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities. However, they often suffer from quadratic complexity in both GPU memory usage and computational inference time, affecting their efficiency. Recently, state space models (SSMs) like Mamba have emerged as a promising alternative, offering a more efficient approach by avoiding these complexities. Given these advantages, we explore the potential of SSM-based models in audio tasks. In this paper, we introduce Self-Supervised Audio Mamba (SSAMBA), the first self-supervised, attention-free, and SSM-based model for audio representation learning. SSAMBA leverages the bidirectional Mamba to capture complex audio patterns effectively. We incorporate a self-supervised pretraining framework that optimizes both discriminative and generative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsAttention Is All You Need · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Adam · Dropout
