GeoLocSFT: Efficient Visual Geolocation via Supervised Fine-Tuning of Multimodal Foundation Models
Qiang Yi, Lianlei Shan

TL;DR
GeoLocSFT demonstrates that targeted supervised fine-tuning of a large multimodal model with a small, high-quality dataset can achieve state-of-the-art visual geolocation performance, especially in sparsely populated regions.
Contribution
The paper introduces GeoLocSFT, a novel framework that uses supervised fine-tuning of Gemma 3 with limited data to improve geolocation accuracy significantly.
Findings
Supervised fine-tuning with only 2700 image-GPS pairs improves geolocation performance.
GeoLocSFT outperforms baseline models on multiple benchmarks including Im2GPS-3k, YFCC-4k, and MR40k.
Core performance gains are achieved at the fine-tuning stage without complex inference strategies.
Abstract
Accurately determining the geographic location where a single image was taken, visual geolocation, remains a formidable challenge due to the planet's vastness and the deceptive similarity among distant locations. We introduce GeoLocSFT, a framework that demonstrates how targeted supervised fine-tuning (SFT) of a large multimodal foundation model (Gemma 3) using a small, high-quality dataset can yield highly competitive geolocation performance. GeoLocSFT is trained with only 2700 carefully selected image-GPS pairs from our geographically diverse MR600k dataset. Despite this limited data, our SFT-centric approach substantially improves over baseline models and achieves robust results on standard benchmarks such as Im2GPS-3k and YFCC-4k, as well as on our newly proposed and challenging MR40k benchmark, aimed specifically at sparsely populated regions. Further, we explore multi-candidate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications
