DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding

Mona Ahmadian; Amir Shirian; Frank Guerin; Andrew Gilbert

arXiv:2506.23196·cs.CV·July 1, 2025

DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding

Mona Ahmadian, Amir Shirian, Frank Guerin, Andrew Gilbert

PDF

Open Access

TL;DR

DEL is a novel dense event localization framework that improves multi-modal audio-visual understanding by accurately detecting multiple actions with fine-grained temporal resolution in complex videos.

Contribution

It introduces a dual-module approach with masked self-attention for intra-mode consistency and multi-scale cross-modal dependency modeling, achieving state-of-the-art results.

Findings

01

Achieves new state-of-the-art performance on multiple TAL datasets.

02

Surpasses previous methods with significant average mAP gains.

03

Effectively models complex temporal dependencies in untrimmed videos.

Abstract

Real-world videos often contain overlapping events and complex temporal dependencies, making multimodal interaction modeling particularly challenging. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features that leverage masked self-attention to enhance intra-mode consistency and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling high-level semantics and fine-grained details. Our method achieves state-of-the-art performance on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100, surpassing previous approaches with notable average mAP gains of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis