Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment
Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter, Pfister, Joon Son Chung

TL;DR
This paper introduces a new synthetic benchmark, evaluation metrics, and a cross-modal alignment learning framework to improve and thoroughly assess interactive sound source localization and cross-modal understanding.
Contribution
It presents a novel approach with enhanced cross-modal alignment, new benchmarks, and metrics, addressing limitations of previous studies in sound source localization.
Findings
The proposed method outperforms existing approaches in localization accuracy.
New benchmarks reveal overlooked issues in current sound source localization methods.
Enhanced evaluation metrics provide a more comprehensive assessment of cross-modal interaction.
Abstract
Recent studies on learning-based sound source localization have mainly focused on the localization performance perspective. However, prior work and existing benchmarks overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. Cross-modal interaction is vital for understanding semantically matched or mismatched audio-visual events, such as silent objects or off-screen sounds. In this paper, we first comprehensively examine the cross-modal interaction of existing methods, benchmarks, evaluation metrics, and cross-modal understanding tasks. Then, we identify the limitations of previous studies and make several contributions to overcome the limitations. First, we introduce a new synthetic benchmark for interactive sound source localization. Second, we introduce new evaluation metrics to rigorously assess sound source localization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHearing Loss and Rehabilitation · Music Technology and Sound Studies · Noise Effects and Management
