Localizing Visual Sounds the Easy Way

Shentong Mo; Pedro Morgado

arXiv:2203.09324·cs.CV·March 30, 2022

Localizing Visual Sounds the Easy Way

Shentong Mo, Pedro Morgado

PDF

Open Access 1 Repo

TL;DR

This paper introduces EZ-VSL, a simple unsupervised method for visual sound localization that aligns audio-visual representations without positive/negative region construction, achieving state-of-the-art results.

Contribution

The work proposes a novel, straightforward approach for unsupervised visual sound localization that does not rely on region classification during training.

Findings

01

Achieves state-of-the-art performance on Flickr SoundNet and VGG-Sound Source datasets.

02

Improves CIoU from 76.80% to 83.94% on Flickr SoundNet.

03

Enhances CIoU from 34.60% to 38.85% on VGG-Sound Source.

Abstract

Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training. Previous works often seek high audio-visual similarities for likely positive (sounding) regions and low similarities for likely negative regions. However, accurately distinguishing between sounding and non-sounding regions is challenging without manual annotations. In this work, we propose a simple yet effective approach for Easy Visual Sound Localization, namely EZ-VSL, without relying on the construction of positive and/or negative regions during training. Instead, we align audio and visual spaces by seeking audio-visual representations that are aligned in, at least, one location of the associated image, while not matching other images, at any location. We also introduce a novel object guided localization scheme at inference time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stonemo/ez-vsl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation