Mamba in Speech: Towards an Alternative to Self-Attention
Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian,, Beena Ahmed, Eliathamby Ambikairajah, Haizhou Li, Julien Epps

TL;DR
This paper investigates the application of Mamba, an alternative to self-attention, in speech processing tasks, demonstrating that bidirectional Mamba improves performance in speech recognition and enhancement over vanilla Mamba.
Contribution
It introduces the use of bidirectional Mamba in speech tasks and shows its advantages as an alternative to self-attention in Transformer models.
Findings
BiMamba outperforms vanilla Mamba in speech tasks.
Bidirectional design enhances speech processing performance.
BiMamba is effective as a self-attention substitute in Transformers.
Abstract
Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and computer vision tasks, but its superiority has rarely been investigated in speech signal processing. This paper explores solutions for applying Mamba to speech processing by discussing two typical speech processing tasks: speech recognition, which requires semantic and sequential information, and speech enhancement, which focuses primarily on sequential patterns. The experimental results confirm that bidirectional Mamba (BiMamba) consistently outperforms vanilla Mamba, highlighting the advantages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducation and Technology Integration
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Dropout
