AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio   Visual Event Localization

Tanvir Mahmud; Diana Marculescu

arXiv:2210.05060·cs.CV·October 12, 2022

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

Tanvir Mahmud, Diana Marculescu

PDF

Open Access 1 Video

TL;DR

This paper introduces AVE-CLIP, a novel multi-modal framework combining AudioCLIP and multi-window temporal transformers to improve audio-visual event localization across multiple temporal scales, achieving state-of-the-art results.

Contribution

The paper presents a multi-stage training framework, a multi-domain attention mechanism, and a temporal refining scheme, advancing multi-modal event localization techniques.

Findings

01

Achieves 5.9% mean accuracy improvement on AVE dataset.

02

Outperforms existing approaches in audio-visual event localization.

03

Effectively captures multi-scale temporal interactions.

Abstract

An audio-visual event (AVE) is denoted by the correspondence of the visual and auditory signals in a video segment. Precise localization of the AVEs is very challenging since it demands effective multi-modal feature correspondence to ground the short and long range temporal interactions. Existing approaches struggle in capturing the different scales of multi-modal interaction due to ineffective multi-modal training strategies. To overcome this limitation, we introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer to effectively operate on different temporal scales of video frames. Our contributions are three-fold: (1) We introduce a multi-stage training framework to incorporate AudioCLIP pre-trained with audio-image pairs into the AVE localization task on video frames through contrastive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization· youtube

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization