Localizing Visual Sounds the Hard Way

Honglie Chen; Weidi Xie; Triantafyllos Afouras; Arsha Nagrani; Andrea; Vedaldi; Andrew Zisserman

arXiv:2104.02691·cs.CV·April 7, 2021·1 cites

Localizing Visual Sounds the Hard Way

Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea, Vedaldi, Andrew Zisserman

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel contrastive learning approach for localizing visible sound sources in videos without manual annotations, achieving state-of-the-art results and introducing a new large-scale dataset.

Contribution

It introduces a hard sample mining mechanism within contrastive learning for sound source localization and provides a new extensive video-based dataset with bounding box annotations.

Findings

01

Achieves state-of-the-art localization performance on Flickr SoundNet.

02

Introduces VGG-SS, a large-scale annotated video dataset for sound source localization.

03

Demonstrates effectiveness of hard sample mining in contrastive learning.

Abstract

The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, we can significantly boost the localization performance. We do so elegantly by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset. Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset, where the sound sources visible in each video clip are explicitly marked with bounding box annotations. This dataset is 20 times larger than analogous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hche11/localizing-visual-sounds-the-hard-way
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection

MethodsContrastive Learning