Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Sung Jin Um; Dongjin Kim; Sangmin Lee; Jung Uk Kim

arXiv:2506.18557·cs.CV·June 25, 2025

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim

PDF

TL;DR

This paper introduces a new audio-visual sound source localization method that uses multimodal large language models to generate detailed contextual information, improving accuracy in complex scenes with similar objects.

Contribution

The paper presents a novel framework leveraging MLLMs and two new loss functions to enhance localization accuracy and semantic differentiation in complex visual scenes.

Findings

01

Significantly outperforms existing methods on MUSIC and VGGSound datasets.

02

Effective in both single-source and multi-source localization scenarios.

03

Demonstrates robustness in scenes with visually similar silent objects.

Abstract

Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in complex scenes, particularly when visually similar silent objects coexist. This limitation arises primarily from their reliance on simple audio-visual correspondence, which does not capture fine-grained semantic differences between sound-making and silent objects. To address these challenges, we propose a novel sound source localization framework leveraging Multimodal Large Language Models (MLLMs) to generate detailed contextual information that explicitly distinguishes between sound-making foreground objects and silent background objects. To effectively integrate this detailed information, we introduce two novel loss functions: Object-aware Contrastive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.