# Learning from Silence and Noise for Visual Sound Source Localization

**Authors:** Xavier Juanola, Giovana Morais, Magdalena Fuentes, Gloria Haro

arXiv: 2508.21761 · 2025-09-01

## TL;DR

This paper introduces a self-supervised model, SSL-SaN, for visual sound source localization that effectively handles silence, noise, and negative audio, improving robustness and evaluation in diverse scenarios.

## Contribution

It presents a new training strategy incorporating silence and noise, a novel metric for feature alignment, and an extended dataset with negative audio for better evaluation.

## Key findings

- SSL-SaN achieves state-of-the-art performance in localization and retrieval.
- The new metric quantifies alignment and separability trade-offs.
- Extended dataset IS3+ includes negative audio scenarios.

## Abstract

Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio.   Our data, metrics and code are available on the https://xavijuanola.github.io/SSL-SaN/.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21761/full.md

## Figures

27 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21761/full.md

## References

69 references — full list in the complete paper: https://tomesphere.com/paper/2508.21761/full.md

---
Source: https://tomesphere.com/paper/2508.21761