MVANet: Multi-Stage Video Attention Network for Sound Event Localization and Detection with Source Distance Estimation
Hengyi Hong, Qing Wang, Jun Du, Ruoyu Wei, Mingqi Cai, Xin Fang

TL;DR
This paper introduces MVANet, a multi-stage video attention network for 3D sound event localization and detection, including source distance estimation, achieving state-of-the-art results in the DCASE 2024 Challenge.
Contribution
The paper presents a novel multi-stage audio-visual network with a new output representation for combined DOA and distance estimation in 3D SELD.
Findings
Outperforms top methods in DCASE 2024 Challenge
Effective use of multi-stage audio features and data augmentation
Accurate source distance estimation in 3D sound localization
Abstract
Sound event localization and detection with source distance estimation (3D SELD) involves not only identifying the sound category and its direction-of-arrival (DOA) but also predicting the source's distance, aiming to provide full information about the sound position. This paper proposes a multi-stage video attention network (MVANet) for audio-visual (AV) 3D SELD. Multi-stage audio features are used to adaptively capture the spatial information of sound sources in videos. We propose a novel output representation that combines the DOA with distance of sound sources by calculating the real Cartesian coordinates to address the newly introduced source distance estimation (SDE) task in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge. We also employ a variety of effective data augmentation and pre-training methods. Experimental results on the STARSS23…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
MethodsSoftmax · Attention Is All You Need
