Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach
Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, Jun Wang

TL;DR
This paper introduces Geo-R, a reinforcement learning framework for image geolocalization that uses hierarchical geographic reasoning without relying on synthetic annotations or external retrieval, improving accuracy and interpretability.
Contribution
Geo-R is a novel retrieval-free, reinforcement learning-based approach that leverages structured geographic reasoning for scalable and interpretable image geolocalization.
Findings
Achieved state-of-the-art accuracy on multiple benchmarks.
Demonstrated strong generalization across diverse datasets.
Provided transparent reasoning paths for localization decisions.
Abstract
Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
