Aligning Sight and Sound: Advanced Sound Source Localization Through   Audio-Visual Alignment

Arda Senocak; Hyeonggon Ryu; Junsik Kim; Tae-Hyun Oh; Hanspeter; Pfister; Joon Son Chung

arXiv:2407.13676·cs.MM·July 19, 2024

Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment

Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter, Pfister, Joon Son Chung

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new synthetic benchmark, evaluation metrics, and a cross-modal alignment learning framework to improve and thoroughly assess interactive sound source localization and cross-modal understanding.

Contribution

It presents a novel approach with enhanced cross-modal alignment, new benchmarks, and metrics, addressing limitations of previous studies in sound source localization.

Findings

01

The proposed method outperforms existing approaches in localization accuracy.

02

New benchmarks reveal overlooked issues in current sound source localization methods.

03

Enhanced evaluation metrics provide a more comprehensive assessment of cross-modal interaction.

Abstract

Recent studies on learning-based sound source localization have mainly focused on the localization performance perspective. However, prior work and existing benchmarks overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. Cross-modal interaction is vital for understanding semantically matched or mismatched audio-visual events, such as silent objects or off-screen sounds. In this paper, we first comprehensively examine the cross-modal interaction of existing methods, benchmarks, evaluation metrics, and cross-modal understanding tasks. Then, we identify the limitations of previous studies and make several contributions to overcome the limitations. First, we introduce a new synthetic benchmark for interactive sound source localization. Second, we introduce new evaluation metrics to rigorously assess sound source localization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaistmm/SSLalignment
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHearing Loss and Rehabilitation · Music Technology and Sound Studies · Noise Effects and Management