Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models
Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, Jiaheng Wei

TL;DR
This paper introduces GLOBE, a novel approach leveraging large vision-language models and a new diverse social media dataset to improve image geo-localization through reasoning, interpretability, and enhanced accuracy.
Contribution
It presents a new reasoning-oriented dataset MP16-Reason and a training pipeline GLOBE that significantly improves geo-localization performance and interpretability over existing methods.
Findings
GLOBE outperforms state-of-the-art LVLMs in geo-localization accuracy.
The new dataset MP16-Reason enhances scene diversity and viewpoint variation.
GLOBE provides more interpretable reasoning trajectories.
Abstract
Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images. We introduce GLOBE, Group-relative policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
