Exploring the Capability of Mamba in Speech Applications

Koichi Miyazaki; Yoshiki Masuyama; Masato Murata

arXiv:2406.16808·cs.SD·June 25, 2024

Exploring the Capability of Mamba in Speech Applications

Koichi Miyazaki, Yoshiki Masuyama, Masato Murata

PDF

Open Access

TL;DR

This paper evaluates Mamba, a state space model architecture, demonstrating its competitive performance and efficiency in various speech tasks compared to Transformer-based models.

Contribution

It provides the first comprehensive comparison of Mamba with Transformer variants across multiple speech applications, highlighting its effectiveness.

Findings

01

Mamba achieves comparable or superior performance to Transformers.

02

Mamba demonstrates efficiency in long-form speech processing.

03

Mamba performs well across diverse speech tasks.

Abstract

This paper explores the capability of Mamba, a recently proposed architecture based on state space models (SSMs), as a competitive alternative to Transformer-based models. In the speech domain, well-designed Transformer-based models, such as the Conformer and E-Branchformer, have become the de facto standards. Extensive evaluations have demonstrated the effectiveness of these Transformer-based models across a wide range of speech tasks. In contrast, the evaluation of SSMs has been limited to a few tasks, such as automatic speech recognition (ASR) and speech synthesis. In this paper, we compared Mamba with state-of-the-art Transformer variants for various speech applications, including ASR, text-to-speech, spoken language understanding, and speech summarization. Experimental evaluations revealed that Mamba achieves comparable or better performance than Transformer-based models, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing