What's Making That Sound Right Now? Video-centric Audio-Visual Localization

Hahyeon Choi; Junhoo Lee; Nojun Kwak

arXiv:2507.04667·cs.CV·July 9, 2025

What's Making That Sound Right Now? Video-centric Audio-Visual Localization

Hahyeon Choi, Junhoo Lee, Nojun Kwak

PDF

Open Access 1 Datasets

TL;DR

This paper introduces AVATAR, a new benchmark with temporal scenarios for audio-visual localization, and TAVLO, a model that effectively uses temporal information to improve source localization accuracy.

Contribution

The paper proposes AVATAR, a comprehensive video-centric AVL benchmark, and TAVLO, a novel model that explicitly incorporates high-resolution temporal information for better localization.

Findings

01

TAVLO outperforms conventional methods in temporal tracking.

02

Temporal dynamics are crucial for accurate audio-visual localization.

03

AVATAR enables more realistic evaluation of AVL models.

Abstract

Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mipal/AVATAR
dataset· 51 dl
51 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation