Look, Listen and Learn
Relja Arandjelovi\'c, Andrew Zisserman

TL;DR
This paper introduces a novel self-supervised learning task using unlabelled videos to learn visual and audio representations by exploiting their natural correspondence, achieving state-of-the-art results in sound classification and competitive performance on ImageNet.
Contribution
The paper proposes the Audio-Visual Correspondence task, enabling learning of visual and audio features from raw videos without labels, advancing self-supervised learning methods.
Findings
Achieved state-of-the-art on two sound classification benchmarks.
Performed on par with top self-supervised methods on ImageNet.
Demonstrated object localization and fine-grained recognition in both modalities.
Abstract
We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
DeepMind's AI Learns Audio And Video Concepts By Itself | Two Minute Papers #184· youtube
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection
