Enhancing Stereo Sound Event Detection with BiMamba and Pretrained PSELDnet
Wenmiao Gao, Han Yin

TL;DR
This paper introduces a stereo sound event detection system that combines a pre-trained PSELDnet with a novel BiMamba sequence model, achieving better performance with lower computational cost on the DCASE2025 dataset.
Contribution
It replaces the Conformer module with a BiMamba module and uses asymmetric convolutions, improving accuracy and efficiency in stereo SELD tasks.
Findings
Outperforms baseline and original PSELDnet models
Reduces computational resource requirements
Effective in capturing time and frequency relationships
Abstract
Pre-training methods have greatly improved the performance of sound event localization and detection (SELD). However, existing Transformer-based models still face high computational cost. To solve this problem, we present a stereo SELD system using a pre-trained PSELDnet and a bidirectional Mamba sequence model. Specifically, we replace the Conformer module with a BiMamba module. We also use asymmetric convolutions to better capture the time and frequency relationships in the audio signal. Test results on the DCASE2025 Task 3 development dataset show that our method performs better than both the baseline and the original PSELDnet with a Conformer decoder. In addition, the proposed model costs fewer computing resources than the baselines. These results show that the BiMamba architecture is effective for solving key challenges in SELD tasks. The source code is publicly accessible at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
