Audio-Visual Event Localization in Unconstrained Videos
Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu

TL;DR
This paper introduces a new problem of localizing events in videos using both audio and visual cues, presents a new dataset, and proposes models to improve multimodal event detection and localization.
Contribution
It defines the novel problem of audio-visual event localization in unconstrained videos, introduces the AVE dataset, and develops models for multimodal fusion and cross-modality localization.
Findings
Joint audio-visual modeling outperforms independent modalities.
Attention mechanisms capture semantics of sounding objects.
Temporal alignment enhances audio-visual fusion.
Abstract
In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event(AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
