Exploring the Capability of Mamba in Speech Applications
Koichi Miyazaki, Yoshiki Masuyama, Masato Murata

TL;DR
This paper evaluates Mamba, a state space model architecture, demonstrating its competitive performance and efficiency in various speech tasks compared to Transformer-based models.
Contribution
It provides the first comprehensive comparison of Mamba with Transformer variants across multiple speech applications, highlighting its effectiveness.
Findings
Mamba achieves comparable or superior performance to Transformers.
Mamba demonstrates efficiency in long-form speech processing.
Mamba performs well across diverse speech tasks.
Abstract
This paper explores the capability of Mamba, a recently proposed architecture based on state space models (SSMs), as a competitive alternative to Transformer-based models. In the speech domain, well-designed Transformer-based models, such as the Conformer and E-Branchformer, have become the de facto standards. Extensive evaluations have demonstrated the effectiveness of these Transformer-based models across a wide range of speech tasks. In contrast, the evaluation of SSMs has been limited to a few tasks, such as automatic speech recognition (ASR) and speech synthesis. In this paper, we compared Mamba with state-of-the-art Transformer variants for various speech applications, including ASR, text-to-speech, spoken language understanding, and speech summarization. Experimental evaluations revealed that Mamba achieves comparable or better performance than Transformer-based models, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
