Skill-Conditioned Visual Geolocation for Vision-Language Models
Chenjie Yang, Yutian Jiang, Yutong Deng, Chenyu Wu

TL;DR
GeoSkill introduces an autonomous, evolving framework for vision-language geolocation that improves reasoning, adapts to new data, and enhances geographic knowledge without parameter updates.
Contribution
It presents a training-free, self-evolving Skill-Graph framework that refines geographic reasoning in vision-language models through autonomous skill synthesis and pruning.
Findings
Achieves promising geolocation accuracy and reasoning faithfulness.
Demonstrates superior generalization across diverse datasets.
Fosters emergence of verifiable, novel geographic skills.
Abstract
Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a "one-off" process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
