Dense Audio-Visual Event Localization under Cross-Modal Consistency and   Multi-Temporal Granularity Collaboration

Ziheng Zhou; Jinxing Zhou; Wei Qian; Shengeng Tang; Xiaojun Chang and; Dan Guo

arXiv:2412.12628·cs.CV·December 19, 2024

Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration

Ziheng Zhou, Jinxing Zhou, Wei Qian, Shengeng Tang, Xiaojun Chang and, Dan Guo

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CCNet, a novel framework for dense audio-visual event localization in long videos, leveraging cross-modal consistency and multi-temporal features to improve scene understanding and achieve state-of-the-art results.

Contribution

The paper proposes a new CCNet model with cross-modal and multi-temporal modules, advancing dense event localization in untrimmed videos with overlapping and varied-duration events.

Findings

01

Achieves state-of-the-art performance on UnAV-100 dataset.

02

Effectively models cross-modal relations and temporal features.

03

Demonstrates robustness in dense, overlapping event scenarios.

Abstract

In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for longer, untrimmed videos. This task seeks to identify and temporally pinpoint all events simultaneously occurring in both audio and visual streams. Typically, each video encompasses dense events of multiple classes, which may overlap on the timeline, each exhibiting varied durations. Given these challenges, effectively exploiting the audio-visual relations and the temporal features encoded at various granularities becomes crucial. To address these challenges, we introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration (CMCC) and the Multi-Temporal Granularity Collaboration (MTGC). Specifically, the CMCC module contains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zzhhfut/ccnet-aaai2025
pytorchOfficial

Videos

Dense Audio-Visual Event Localization Under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration· underline

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies

MethodsCriss-Cross Network · Focus