Rethinking Mamba in Speech Processing by Self-Supervised Models

Xiangyu Zhang; Jianbo Ma; Mostafa Shahin; Beena Ahmed; Julien Epps

arXiv:2409.07273·eess.AS·September 12, 2024·ICASSP

Rethinking Mamba in Speech Processing by Self-Supervised Models

Xiangyu Zhang, Jianbo Ma, Mostafa Shahin, Beena Ahmed, Julien Epps

PDF

Open Access

TL;DR

This paper investigates the performance of Mamba-based models in speech processing, revealing they excel in reconstruction tasks but require additional modules for classification tasks like speech recognition, supported by information theory analysis.

Contribution

The study provides a new understanding of Mamba models' strengths and limitations in speech tasks, introducing a hypothesis and validating it through information theory and HuBERT integration.

Findings

01

Mamba models perform well in speech reconstruction tasks.

02

Additional modules are needed for speech recognition tasks.

03

Mutual information analysis supports the hypothesis.

Abstract

The Mamba-based model has demonstrated outstanding performance across tasks in computer vision, natural language processing, and speech processing. However, in the realm of speech processing, the Mamba-based model's performance varies across different tasks. For instance, in tasks such as speech enhancement and spectrum reconstruction, the Mamba model performs well when used independently. However, for tasks like speech recognition, additional modules are required to surpass the performance of attention-based models. We propose the hypothesis that the Mamba-based model excels in "reconstruction" tasks within speech processing. However, for "classification tasks" such as Speech Recognition, additional modules are necessary to accomplish the "reconstruction" step. To validate our hypothesis, we analyze the previous Mamba-based Speech Models from an information theory perspective.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces