Objects that Sound

Relja Arandjelovi\'c; Andrew Zisserman

arXiv:1712.06651·cs.CV·July 27, 2018·2 cites

Objects that Sound

Relja Arandjelovi\'c, Andrew Zisserman

PDF

Open Access 2 Datasets 1 Video

TL;DR

This paper presents a self-supervised approach to embed audio and visual data into a shared space for cross-modal retrieval and localize sound sources in images, using unlabelled video data.

Contribution

It introduces new architectures for cross-modal embedding and sound source localization using audio-visual correspondence without labeled data.

Findings

01

Audio-visual embeddings enable cross-modal retrieval.

02

Localization of sound sources in images is achievable without motion cues.

03

Various architectures for AVC improve understanding of sound-object relationships.

Abstract

In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video. To this end, we design new network architectures that can be trained for cross-modal retrieval and localizing the sound source in an image, by using the AVC task. We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

DeepMind's AI Learns Object Sounds | Two Minute Papers #224· youtube

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Hearing Loss and Rehabilitation