Cross-attention Inspired Selective State Space Models for Target Sound Extraction

Donghang Wu; Yiwen Wang; Xihong Wu; Tianshu Qu

arXiv:2409.04803·eess.AS·June 26, 2025

Cross-attention Inspired Selective State Space Models for Target Sound Extraction

Donghang Wu, Yiwen Wang, Xihong Wu, Tianshu Qu

PDF

Open Access 1 Repo

TL;DR

This paper introduces CrossMamba, a novel model that combines the efficiency of state space models with the dependency capturing ability of cross-attention, improving target sound extraction performance.

Contribution

It proposes CrossMamba, integrating Mamba's hidden attention with cross-attention principles to enhance target sound extraction while maintaining computational efficiency.

Findings

01

CrossMamba outperforms traditional methods in accuracy.

02

It reduces computational complexity compared to Transformer-based models.

03

Experimental validation confirms its effectiveness.

Abstract

The Transformer model, particularly its cross-attention module, is widely used for feature fusion in target sound extraction which extracts the signal of interest based on given clues. Despite its effectiveness, this approach suffers from low computational efficiency. Recent advancements in state space models, notably the latest work Mamba, have shown comparable performance to Transformer-based methods while significantly reducing computational complexity in various tasks. However, Mamba's applicability in target sound extraction is limited due to its inability to capture dependencies between different sequences as the cross-attention does. In this paper, we propose CrossMamba for target sound extraction, which leverages the hidden attention mechanism of Mamba to compute dependencies between the given clues and the audio mixture. The calculation of Mamba can be divided to the query, key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WuDH2000/CrossMamba
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection