What's Making That Sound Right Now? Video-centric Audio-Visual Localization
Hahyeon Choi, Junhoo Lee, Nojun Kwak

TL;DR
This paper introduces AVATAR, a new benchmark with temporal scenarios for audio-visual localization, and TAVLO, a model that effectively uses temporal information to improve source localization accuracy.
Contribution
The paper proposes AVATAR, a comprehensive video-centric AVL benchmark, and TAVLO, a novel model that explicitly incorporates high-resolution temporal information for better localization.
Findings
TAVLO outperforms conventional methods in temporal tracking.
Temporal dynamics are crucial for accurate audio-visual localization.
AVATAR enables more realistic evaluation of AVL models.
Abstract
Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
