DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding
Mona Ahmadian, Amir Shirian, Frank Guerin, Andrew Gilbert

TL;DR
DEL is a novel dense event localization framework that improves multi-modal audio-visual understanding by accurately detecting multiple actions with fine-grained temporal resolution in complex videos.
Contribution
It introduces a dual-module approach with masked self-attention for intra-mode consistency and multi-scale cross-modal dependency modeling, achieving state-of-the-art results.
Findings
Achieves new state-of-the-art performance on multiple TAL datasets.
Surpasses previous methods with significant average mAP gains.
Effectively models complex temporal dependencies in untrimmed videos.
Abstract
Real-world videos often contain overlapping events and complex temporal dependencies, making multimodal interaction modeling particularly challenging. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features that leverage masked self-attention to enhance intra-mode consistency and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling high-level semantics and fine-grained details. Our method achieves state-of-the-art performance on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100, surpassing previous approaches with notable average mAP gains of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
