Selective State Space Model for Monaural Speech Enhancement

Moran Chen; Qiquan Zhang; Mingjiang Wang; Xiangyu Zhang; Hexin Liu,; Eliathamby Ambikairaiah; and Deying Chen

arXiv:2411.06217·eess.AS·November 12, 2024·IEEE Trans. Consumer Electron.

Selective State Space Model for Monaural Speech Enhancement

Moran Chen, Qiquan Zhang, Mingjiang Wang, Xiangyu Zhang, Hexin Liu,, Eliathamby Ambikairaiah, and Deying Chen

PDF

Open Access

TL;DR

This paper introduces MambaDC, a hybrid convolution-Mamba model that effectively handles long sequences with linear complexity, improving speech enhancement performance over existing Transformer-based models.

Contribution

The paper proposes MambaDC, a novel hybrid backbone combining convolutional networks and the Mamba state space model for improved long-range dependency modeling in speech enhancement.

Findings

01

MambaDC outperforms Transformer, Conformer, and standard Mamba in experiments.

02

MambaDC achieves superior results across all training targets.

03

The model demonstrates efficient long-range global dependency modeling.

Abstract

Voice user interfaces (VUIs) have facilitated the efficient interactions between humans and machines through spoken commands. Since real-word acoustic scenes are complex, speech enhancement plays a critical role for robust VUI. Transformer and its variants, such as Conformer, have demonstrated cutting-edge results in speech enhancement. However, both of them suffers from the quadratic computational complexity with respect to the sequence length, which hampers their ability to handle long sequences. Recently a novel State Space Model called Mamba, which shows strong capability to handle long sequences with linear complexity, offers a solution to address this challenge. In this paper, we propose a novel hybrid convolution-Mamba backbone, denoted as MambaDC, for speech enhancement. Our MambaDC marries the benefits of convolutional networks to model the local interactions and Mamba's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Absolute Position Encodings · Label Smoothing · Layer Normalization · Adam · Multi-Head Attention · Position-Wise Feed-Forward Layer · Residual Connection