How to Listen? Rethinking Visual Sound Localization

Ho-Hsiang Wu; Magdalena Fuentes; Prem Seetharaman; Juan Pablo Bello

arXiv:2204.05156·cs.SD·April 12, 2022

How to Listen? Rethinking Visual Sound Localization

Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan Pablo Bello

PDF

Open Access 1 Repo

TL;DR

This paper critically examines various model components for visual sound localization, analyzing their impact on performance across diverse datasets and challenging scenarios, and provides insights for real-world application improvements.

Contribution

It offers a comprehensive analysis of model choices and their effects on localization performance in complex environments, highlighting the importance of design decisions.

Findings

01

Model architecture and loss functions significantly influence localization accuracy.

02

Different datasets reveal varying sensitivities to model components.

03

Open-sourced code facilitates further research and application.

Abstract

Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works are usually evaluated with datasets having mostly a single dominant visible object, and proposed models usually require the introduction of localization modules during training or dedicated sampling strategies, but it remains unclear how these design choices play a role in the adaptability of these methods in more challenging scenarios. In this work, we analyze various model choices for visual sound localization and discuss how their different components affect the model's performance, namely the encoders' architecture, the loss function and the localization strategy. Furthermore, we study the interaction between these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hohsiangwu/rethinking-visual-sound-localization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies