TL;DR
This paper introduces a training-free sound source localization method using MLLM meta-reasoning, which mimics human reasoning to improve localization in complex scenes without training.
Contribution
It proposes a novel Generation-Analysis-Refinement framework leveraging MLLMs for training-free sound source localization, enhancing reasoning and verification capabilities.
Findings
Achieves competitive results on single-source and multi-source benchmarks.
Demonstrates the effectiveness of reasoning-based, training-free localization.
Provides a new paradigm for sound source localization without training data.
Abstract
Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
