Speech-Mamba: Long-Context Speech Recognition with Selective State   Spaces Models

Xiaoxue Gao; Nancy F. Chen

arXiv:2409.18654·eess.AS·September 30, 2024

Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models

Xiaoxue Gao, Nancy F. Chen

PDF

Open Access

TL;DR

Speech-Mamba introduces a novel long-context speech recognition model that combines selective state space models with Transformers, enabling efficient long-range dependency modeling with near-linear scaling.

Contribution

It pioneers the integration of selective state space models into speech recognition, enhancing long-sequence modeling capabilities beyond previous Transformer limitations.

Findings

01

Outperforms traditional models on long speech sequences

02

Scales near-linearly with sequence length

03

Improves long-range dependency modeling in speech recognition

Abstract

Current automatic speech recognition systems struggle with modeling long speech sequences due to high quadratic complexity of Transformer-based models. Selective state space models such as Mamba has performed well on long-sequence modeling in natural language processing and computer vision tasks. However, research endeavors in speech technology tasks has been under-explored. We propose Speech-Mamba, which incorporates selective state space modeling in Transformer neural architectures. Long sequence representations with selective state space models in Speech-Mamba is complemented with lower-level representations from Transformer-based modeling. Speech-mamba achieves better capacity to model long-range dependencies, as it scales near-linearly with sequence length.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings