Enhancing Stereo Sound Event Detection with BiMamba and Pretrained PSELDnet

Wenmiao Gao; Han Yin

arXiv:2507.09570·eess.AS·July 15, 2025

Enhancing Stereo Sound Event Detection with BiMamba and Pretrained PSELDnet

Wenmiao Gao, Han Yin

PDF

Open Access

TL;DR

This paper introduces a stereo sound event detection system that combines a pre-trained PSELDnet with a novel BiMamba sequence model, achieving better performance with lower computational cost on the DCASE2025 dataset.

Contribution

It replaces the Conformer module with a BiMamba module and uses asymmetric convolutions, improving accuracy and efficiency in stereo SELD tasks.

Findings

01

Outperforms baseline and original PSELDnet models

02

Reduces computational resource requirements

03

Effective in capturing time and frequency relationships

Abstract

Pre-training methods have greatly improved the performance of sound event localization and detection (SELD). However, existing Transformer-based models still face high computational cost. To solve this problem, we present a stereo SELD system using a pre-trained PSELDnet and a bidirectional Mamba sequence model. Specifically, we replace the Conformer module with a BiMamba module. We also use asymmetric convolutions to better capture the time and frequency relationships in the audio signal. Test results on the DCASE2025 Task 3 development dataset show that our method performs better than both the baseline and the original PSELDnet with a Conformer decoder. In addition, the proposed model costs fewer computing resources than the baselines. These results show that the BiMamba architecture is effective for solving key challenges in SELD tasks. The source code is publicly accessible at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis