Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Andrew Owens; Alexei A. Efros

arXiv:1804.03641·cs.CV·October 10, 2018

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Andrew Owens, Alexei A. Efros

PDF

1 Repo

TL;DR

This paper introduces a self-supervised method to learn fused audio-visual representations for analyzing scenes, enabling tasks like sound source localization, action recognition, and source separation.

Contribution

It proposes a novel self-supervised approach to jointly model visual and audio signals, improving multisensory scene understanding.

Findings

01

Effective sound source localization in videos

02

Improved audio-visual action recognition accuracy

03

Successful off-screen audio source separation

Abstract

The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech. Code, models, and video results are available on our webpage:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

andrewowens/multisensory
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.