MVANet: Multi-Stage Video Attention Network for Sound Event Localization   and Detection with Source Distance Estimation

Hengyi Hong; Qing Wang; Jun Du; Ruoyu Wei; Mingqi Cai; Xin Fang

arXiv:2411.14153·eess.AS·November 22, 2024·ICASSP

MVANet: Multi-Stage Video Attention Network for Sound Event Localization and Detection with Source Distance Estimation

Hengyi Hong, Qing Wang, Jun Du, Ruoyu Wei, Mingqi Cai, Xin Fang

PDF

Open Access 1 Repo

TL;DR

This paper introduces MVANet, a multi-stage video attention network for 3D sound event localization and detection, including source distance estimation, achieving state-of-the-art results in the DCASE 2024 Challenge.

Contribution

The paper presents a novel multi-stage audio-visual network with a new output representation for combined DOA and distance estimation in 3D SELD.

Findings

01

Outperforms top methods in DCASE 2024 Challenge

02

Effective use of multi-stage audio features and data augmentation

03

Accurate source distance estimation in 3D sound localization

Abstract

Sound event localization and detection with source distance estimation (3D SELD) involves not only identifying the sound category and its direction-of-arrival (DOA) but also predicting the source's distance, aiming to provide full information about the sound position. This paper proposes a multi-stage video attention network (MVANet) for audio-visual (AV) 3D SELD. Multi-stage audio features are used to adaptively capture the spatial information of sound sources in videos. We propose a novel output representation that combines the DOA with distance of sound sources by calculating the real Cartesian coordinates to address the newly introduced source distance estimation (SDE) task in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge. We also employ a variety of effective data augmentation and pre-training methods. Experimental results on the STARSS23…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Hong-Hengyi/MVANet-SELD
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsSoftmax · Attention Is All You Need