Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Ling Xing; Hongyu Qu; Rui Yan; Xiangbo Shu; Jinhui Tang

arXiv:2409.07967·cs.CV·May 12, 2025

Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang

PDF

Open Access

TL;DR

This paper introduces LoCo, a novel framework for dense audio-visual event localization that leverages local temporal coherence to improve cross-modal alignment and event boundary detection.

Contribution

LoCo employs local correspondence feature modulation and adaptive cross-modal interaction to enhance shared semantics and focus attention on relevant event boundaries.

Findings

01

Outperforms existing DAVE methods in localization accuracy

02

Effectively filters irrelevant cross-modal signals

03

Improves focus on local event boundaries

Abstract

Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that are both audible and visible in a long video, where events may co-occur and exhibit varying durations. However, complex audio-visual scenes often involve asynchronization between modalities, making accurate localization challenging. Existing DAVE solutions extract audio and visual features through unimodal encoders, and fuse them via dense cross-modal interaction. However, independent unimodal encoding struggles to emphasize shared semantics between modalities without cross-modal guidance, while dense cross-modal attention may over-attend to semantically unrelated audio-visual features. To address these problems, we present LoCo, a Locality-aware cross-modal Correspondence learning framework for DAVE. LoCo leverages the local temporal continuity of audio-visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Subtitles and Audiovisual Media

MethodsSoftmax · Attention Is All You Need · Focus · Lipschitz Constant Constraint