Look, Listen and Learn

Relja Arandjelovi\'c; Andrew Zisserman

arXiv:1705.08168·cs.CV·August 2, 2017·40 cites

Look, Listen and Learn

Relja Arandjelovi\'c, Andrew Zisserman

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

This paper introduces a novel self-supervised learning task using unlabelled videos to learn visual and audio representations by exploiting their natural correspondence, achieving state-of-the-art results in sound classification and competitive performance on ImageNet.

Contribution

The paper proposes the Audio-Visual Correspondence task, enabling learning of visual and audio features from raw videos without labels, advancing self-supervised learning methods.

Findings

01

Achieved state-of-the-art on two sound classification benchmarks.

02

Performed on par with top self-supervised methods on ImageNet.

03

Demonstrated object localization and fine-grained recognition in both modalities.

Abstract

We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marl/l3embedding
tf

Datasets

Videos

DeepMind's AI Learns Audio And Video Concepts By Itself | Two Minute Papers #184· youtube

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection