SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning
Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, Yu Liu

TL;DR
SpotAgent introduces an agentic reasoning framework for visual geo-localization that combines large vision-language models with external tools and a multi-stage training pipeline, achieving state-of-the-art accuracy and verifiability.
Contribution
The paper presents a novel agentic reasoning approach with a multi-stage training pipeline and dynamic filtering, improving geo-localization accuracy and verifiability over prior methods.
Findings
Achieves state-of-the-art performance on geo-localization benchmarks.
Effectively reduces hallucinations and improves result verifiability.
Enhances reasoning with external tools and a specialized training strategy.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo-localization into an agentic reasoning process that leverages expert-level reasoning to synergize visual interpretation with tool-assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3-stage post-training pipeline starting with a Supervised Fine-Tuning (SFT) stage for basic alignment, followed by an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
