Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos

Davide Berghi; Philip J. B. Jackson

arXiv:2507.04845·eess.AS·July 8, 2025

Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos

Davide Berghi, Philip J. B. Jackson

PDF

TL;DR

This paper introduces a multimodal approach for stereo sound event localization and detection in videos, integrating semantic embeddings and autocorrelation features to improve accuracy beyond traditional methods.

Contribution

The authors develop a novel Cross-Modal Conformer architecture that fuses audio, visual, and semantic embeddings, enhancing SELD performance with pre-training and data augmentation techniques.

Findings

01

Significant performance improvement over baseline models.

02

Effective integration of semantic and visual information.

03

Enhanced distance estimation with autocorrelation features.

Abstract

This report presents our systems submitted to the audio-only and audio-visual tracks of the DCASE2025 Task 3 Challenge: Stereo Sound Event Localization and Detection (SELD) in Regular Video Content. SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD architectures rely on multichannel input, which limits their ability to leverage large-scale pre-training due to data constraints. To address this, we enhance standard SELD architectures with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.