Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models

Ling Li; Yao Zhou; Yuxuan Liang; Fugee Tsung; Jiaheng Wei

arXiv:2506.14674·cs.CV·October 27, 2025

Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models

Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, Jiaheng Wei

PDF

Open Access 1 Video

TL;DR

This paper introduces GLOBE, a novel approach leveraging large vision-language models and a new diverse social media dataset to improve image geo-localization through reasoning, interpretability, and enhanced accuracy.

Contribution

It presents a new reasoning-oriented dataset MP16-Reason and a training pipeline GLOBE that significantly improves geo-localization performance and interpretability over existing methods.

Findings

01

GLOBE outperforms state-of-the-art LVLMs in geo-localization accuracy.

02

The new dataset MP16-Reason enhances scene diversity and viewpoint variation.

03

GLOBE provides more interpretable reasoning trajectories.

Abstract

Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images. We introduce GLOBE, Group-relative policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models· slideslive

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques