Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition
Yoshiki Masuyama, Koichi Miyazaki, Masato Murata

TL;DR
This paper introduces MADEON, a decoder-only speech recognition model using Mamba SSMs with bidirectional speech modeling, achieving competitive performance with Transformer-based models on large datasets.
Contribution
It presents a novel decoder-only ASR architecture with bidirectional speech modeling using Mamba SSMs, enhancing efficiency and performance.
Findings
MADEON outperforms non-selective SSM models.
Speech prefixing improves contextual understanding.
Achieves comparable results to Transformer models.
Abstract
Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). Mamba has been applied to ASR task with the attention-based encoder-decoder framework, where the cross-attention mechanism between encoder and decoder remains. This paper explores the capability of Mamba as the decoder-only architecture in ASR task. Our MAmba-based DEcoder-ONly approach (MADEON) consists of a single decoder that takes speech tokens as a condition and predicts text tokens in an autoregressive manner. To enhance MADEON, we further propose speech prefixing that performs bidirectional processing on speech tokens, which enriches the contextual information in the hidden states. Our experiments show that MADEON significantly outperforms a non-selective SSM. The combination of speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
