Exploring Audio-Visual Information Fusion for Sound Event Localization   and Detection In Low-Resource Realistic Scenarios

Ya Jiang; Qing Wang; Jun Du; Maocheng Hu; Pengfei Hu; Zeyan Liu; Shi; Cheng; Zhaoxu Nian; Yuxuan Dong; Mingqi Cai; Xin Fang; Chin-Hui Lee

arXiv:2406.15160·eess.AS·June 24, 2024·ICME

Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios

Ya Jiang, Qing Wang, Jun Du, Maocheng Hu, Pengfei Hu, Zeyan Liu, Shi, Cheng, Zhaoxu Nian, Yuxuan Dong, Mingqi Cai, Xin Fang, Chin-Hui Lee

PDF

Open Access

TL;DR

This paper introduces a novel audio-visual fusion framework for sound event localization and detection in low-resource scenarios, leveraging cross-modal learning, multi-stage fusion, and innovative augmentation techniques to improve performance.

Contribution

It proposes a cross-modal teacher-student learning framework, a two-stage fusion strategy, and a video pixel swapping augmentation, advancing SELD in low-resource settings.

Findings

01

Achieved top performance on DCASE 2023 Challenge dataset.

02

Significant improvements in SELD accuracy with proposed methods.

03

Model ensemble with integrated techniques ranked first in challenge.

Abstract

This study presents an audio-visual information fusion approach to sound event localization and detection (SELD) in low-resource scenarios. We aim at utilizing audio and video modality information through cross-modal learning and multi-modal fusion. First, we propose a cross-modal teacher-student learning (TSL) framework to transfer information from an audio-only teacher model, trained on a rich collection of audio data with multiple data augmentation techniques, to an audio-visual student model trained with only a limited set of multi-modal data. Next, we propose a two-stage audio-visual fusion strategy, consisting of an early feature fusion and a late video-guided decision fusion to exploit synergies between audio and video modalities. Finally, we introduce an innovative video pixel swapping (VPS) technique to extend an audio channel swapping (ACS) method to an audio-visual joint…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsSparse Evolutionary Training