Objects that Sound
Relja Arandjelovi\'c, Andrew Zisserman

TL;DR
This paper presents a self-supervised approach to embed audio and visual data into a shared space for cross-modal retrieval and localize sound sources in images, using unlabelled video data.
Contribution
It introduces new architectures for cross-modal embedding and sound source localization using audio-visual correspondence without labeled data.
Findings
Audio-visual embeddings enable cross-modal retrieval.
Localization of sound sources in images is achievable without motion cues.
Various architectures for AVC improve understanding of sound-object relationships.
Abstract
In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video. To this end, we design new network architectures that can be trained for cross-modal retrieval and localizing the sound source in an image, by using the AVC task. We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
DeepMind's AI Learns Object Sounds | Two Minute Papers #224· youtube
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Hearing Loss and Rehabilitation
