Object-aware Sound Source Localization via Audio-Visual Scene Understanding
Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim

TL;DR
This paper introduces a new audio-visual sound source localization method that uses multimodal large language models to generate detailed contextual information, improving accuracy in complex scenes with similar objects.
Contribution
The paper presents a novel framework leveraging MLLMs and two new loss functions to enhance localization accuracy and semantic differentiation in complex visual scenes.
Findings
Significantly outperforms existing methods on MUSIC and VGGSound datasets.
Effective in both single-source and multi-source localization scenarios.
Demonstrates robustness in scenes with visually similar silent objects.
Abstract
Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in complex scenes, particularly when visually similar silent objects coexist. This limitation arises primarily from their reliance on simple audio-visual correspondence, which does not capture fine-grained semantic differences between sound-making and silent objects. To address these challenges, we propose a novel sound source localization framework leveraging Multimodal Large Language Models (MLLMs) to generate detailed contextual information that explicitly distinguishes between sound-making foreground objects and silent background objects. To effectively integrate this detailed information, we introduce two novel loss functions: Object-aware Contrastive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
