Localizing Visual Sounds the Hard Way
Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea, Vedaldi, Andrew Zisserman

TL;DR
This paper presents a novel contrastive learning approach for localizing visible sound sources in videos without manual annotations, achieving state-of-the-art results and introducing a new large-scale dataset.
Contribution
It introduces a hard sample mining mechanism within contrastive learning for sound source localization and provides a new extensive video-based dataset with bounding box annotations.
Findings
Achieves state-of-the-art localization performance on Flickr SoundNet.
Introduces VGG-SS, a large-scale annotated video dataset for sound source localization.
Demonstrates effectiveness of hard sample mining in contrastive learning.
Abstract
The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, we can significantly boost the localization performance. We do so elegantly by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset. Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset, where the sound sources visible in each video clip are explicitly marked with bounding box annotations. This dataset is 20 times larger than analogous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection
MethodsContrastive Learning
