Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning

Subin Park; Jung Uk Kim

arXiv:2604.06824·cs.CV·April 9, 2026

Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning

Subin Park, Jung Uk Kim

PDF

1 Repo

TL;DR

This paper introduces a training-free sound source localization method using MLLM meta-reasoning, which mimics human reasoning to improve localization in complex scenes without training.

Contribution

It proposes a novel Generation-Analysis-Refinement framework leveraging MLLMs for training-free sound source localization, enhancing reasoning and verification capabilities.

Findings

01

Achieves competitive results on single-source and multi-source benchmarks.

02

Demonstrates the effectiveness of reasoning-based, training-free localization.

03

Provides a new paradigm for sound source localization without training data.

Abstract

Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

VisualAIKHU/GAR-SSL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.