CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization

Jinxing Zhou; Ziheng Zhou; Yanghao Zhou; Yuxin Mao; Zhangling Duan; Dan Guo

arXiv:2508.04566·cs.CV·August 7, 2025

CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization

Jinxing Zhou, Ziheng Zhou, Yanghao Zhou, Yuxin Mao, Zhangling Duan, Dan Guo

PDF

1 Video

TL;DR

This paper introduces a novel weakly-supervised method for dense audio-visual event localization that leverages cross-modal salient anchors and semantic propagation to improve temporal localization accuracy.

Contribution

It proposes a new framework utilizing cross-modal salient anchors and semantic propagation for weakly-supervised dense audio-visual event localization, achieving state-of-the-art results.

Findings

01

Achieves state-of-the-art performance on UnAV-100 and ActivityNet1.3 datasets.

02

Effectively identifies reliable cross-modal temporal anchors under weak supervision.

03

Enhances event semantic encoding through anchor-based temporal propagation.

Abstract

The Dense Audio-Visual Event Localization (DAVEL) task aims to temporally localize events in untrimmed videos that occur simultaneously in both the audio and visual modalities. This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task), where only video-level event labels are provided and the temporal boundaries of each event are unknown. We address W-DAVEL by exploiting \textit{cross-modal salient anchors}, which are defined as reliable timestamps that are well predicted under weak supervision and exhibit highly consistent event semantics across audio and visual modalities. Specifically, we propose a \textit{Mutual Event Agreement Evaluation} module, which generates an agreement score by measuring the discrepancy between the predicted audio and visual event classes. Then, the agreement score is utilized in a \textit{Cross-modal Salient Anchor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization· underline