Cross-attention Inspired Selective State Space Models for Target Sound Extraction
Donghang Wu, Yiwen Wang, Xihong Wu, Tianshu Qu

TL;DR
This paper introduces CrossMamba, a novel model that combines the efficiency of state space models with the dependency capturing ability of cross-attention, improving target sound extraction performance.
Contribution
It proposes CrossMamba, integrating Mamba's hidden attention with cross-attention principles to enhance target sound extraction while maintaining computational efficiency.
Findings
CrossMamba outperforms traditional methods in accuracy.
It reduces computational complexity compared to Transformer-based models.
Experimental validation confirms its effectiveness.
Abstract
The Transformer model, particularly its cross-attention module, is widely used for feature fusion in target sound extraction which extracts the signal of interest based on given clues. Despite its effectiveness, this approach suffers from low computational efficiency. Recent advancements in state space models, notably the latest work Mamba, have shown comparable performance to Transformer-based methods while significantly reducing computational complexity in various tasks. However, Mamba's applicability in target sound extraction is limited due to its inability to capture dependencies between different sequences as the cross-attention does. In this paper, we propose CrossMamba for target sound extraction, which leverages the hidden attention mechanism of Mamba to compute dependencies between the given clues and the audio mixture. The calculation of Mamba can be divided to the query, key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection
