Localizing Visual Sounds the Easy Way
Shentong Mo, Pedro Morgado

TL;DR
This paper introduces EZ-VSL, a simple unsupervised method for visual sound localization that aligns audio-visual representations without positive/negative region construction, achieving state-of-the-art results.
Contribution
The work proposes a novel, straightforward approach for unsupervised visual sound localization that does not rely on region classification during training.
Findings
Achieves state-of-the-art performance on Flickr SoundNet and VGG-Sound Source datasets.
Improves CIoU from 76.80% to 83.94% on Flickr SoundNet.
Enhances CIoU from 34.60% to 38.85% on VGG-Sound Source.
Abstract
Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training. Previous works often seek high audio-visual similarities for likely positive (sounding) regions and low similarities for likely negative regions. However, accurately distinguishing between sounding and non-sounding regions is challenging without manual annotations. In this work, we propose a simple yet effective approach for Easy Visual Sound Localization, namely EZ-VSL, without relying on the construction of positive and/or negative regions during training. Instead, we align audio and visual spaces by seeking audio-visual representations that are aligned in, at least, one location of the associated image, while not matching other images, at any location. We also introduce a novel object guided localization scheme at inference time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
