Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for   Speech Recognition

Yoshiki Masuyama; Koichi Miyazaki; Masato Murata

arXiv:2411.06968·cs.SD·November 12, 2024

Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition

Yoshiki Masuyama, Koichi Miyazaki, Masato Murata

PDF

Open Access 1 Repo

TL;DR

This paper introduces MADEON, a decoder-only speech recognition model using Mamba SSMs with bidirectional speech modeling, achieving competitive performance with Transformer-based models on large datasets.

Contribution

It presents a novel decoder-only ASR architecture with bidirectional speech modeling using Mamba SSMs, enhancing efficiency and performance.

Findings

01

MADEON outperforms non-selective SSM models.

02

Speech prefixing improves contextual understanding.

03

Achieves comparable results to Transformer models.

Abstract

Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). Mamba has been applied to ASR task with the attention-based encoder-decoder framework, where the cross-attention mechanism between encoder and decoder remains. This paper explores the capability of Mamba as the decoder-only architecture in ASR task. Our MAmba-based DEcoder-ONly approach (MADEON) consists of a single decoder that takes speech tokens as a condition and predicts text tokens in an autoregressive manner. To enhance MADEON, we further propose speech prefixing that performs bidirectional processing on speech tokens, which enriches the contextual information in the hidden states. Our experiments show that MADEON significantly outperforms a non-selective SSM. The combination of speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YoshikiMas/madeon-asr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques