Audio-Visual Event Localization in Unconstrained Videos

Yapeng Tian; Jing Shi; Bochen Li; Zhiyao Duan; and Chenliang Xu

arXiv:1803.08842·cs.CV·March 26, 2018·28 cites

Audio-Visual Event Localization in Unconstrained Videos

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu

PDF

Open Access 2 Repos

TL;DR

This paper introduces a new problem of localizing events in videos using both audio and visual cues, presents a new dataset, and proposes models to improve multimodal event detection and localization.

Contribution

It defines the novel problem of audio-visual event localization in unconstrained videos, introduces the AVE dataset, and develops models for multimodal fusion and cross-modality localization.

Findings

01

Joint audio-visual modeling outperforms independent modalities.

02

Attention mechanisms capture semantics of sounding objects.

03

Temporal alignment enhances audio-visual fusion.

Abstract

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event(AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization