ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization

Huilai Li; Yonghao Dang; Ying Xing; Yiming Wang; Jianqin Yin

arXiv:2507.09945·cs.MM·October 16, 2025

ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization

Huilai Li, Yonghao Dang, Ying Xing, Yiming Wang, Jianqin Yin

PDF

Open Access

TL;DR

ESG-Net introduces a multi-stage semantic guidance and multi-event dependency modeling approach to improve dense audio-visual event localization, achieving superior accuracy with fewer parameters.

Contribution

The paper proposes ESG-Net, which incorporates hierarchical semantic understanding and adaptive event dependency extraction, addressing semantic gaps and event correlation challenges in DAVE.

Findings

01

Outperforms state-of-the-art methods on benchmark datasets.

02

Reduces model parameters and computational load significantly.

03

Enhances hierarchical semantic understanding of audio-visual events.

Abstract

Dense audio-visual event localization (DAVE) aims to identify event categories and locate the temporal boundaries in untrimmed videos. Most studies only employ event-related semantic constraints on the final outputs, lacking cross-modal semantic bridging in intermediate layers. This causes modality semantic gap for further fusion, making it difficult to distinguish between event-related content and irrelevant background content. Moreover, they rarely consider the correlations between events, which limits the model to infer concurrent events among complex scenarios. In this paper, we incorporate multi-stage semantic guidance and multi-event relationship modeling, which respectively enable hierarchical semantic understanding of audio-visual events and adaptive extraction of event dependencies, thereby better focusing on event-related information. Specifically, our eventaware semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Digital Media Forensic Detection