GeoLocSFT: Efficient Visual Geolocation via Supervised Fine-Tuning of Multimodal Foundation Models

Qiang Yi; Lianlei Shan

arXiv:2506.01277·cs.AI·June 3, 2025

GeoLocSFT: Efficient Visual Geolocation via Supervised Fine-Tuning of Multimodal Foundation Models

Qiang Yi, Lianlei Shan

PDF

Open Access

TL;DR

GeoLocSFT demonstrates that targeted supervised fine-tuning of a large multimodal model with a small, high-quality dataset can achieve state-of-the-art visual geolocation performance, especially in sparsely populated regions.

Contribution

The paper introduces GeoLocSFT, a novel framework that uses supervised fine-tuning of Gemma 3 with limited data to improve geolocation accuracy significantly.

Findings

01

Supervised fine-tuning with only 2700 image-GPS pairs improves geolocation performance.

02

GeoLocSFT outperforms baseline models on multiple benchmarks including Im2GPS-3k, YFCC-4k, and MR40k.

03

Core performance gains are achieved at the fine-tuning stage without complex inference strategies.

Abstract

Accurately determining the geographic location where a single image was taken, visual geolocation, remains a formidable challenge due to the planet's vastness and the deceptive similarity among distant locations. We introduce GeoLocSFT, a framework that demonstrates how targeted supervised fine-tuning (SFT) of a large multimodal foundation model (Gemma 3) using a small, high-quality dataset can yield highly competitive geolocation performance. GeoLocSFT is trained with only 2700 carefully selected image-GPS pairs from our geographically diverse MR600k dataset. Despite this limited data, our SFT-centric approach substantially improves over baseline models and achieves robust results on standard benchmarks such as Im2GPS-3k and YFCC-4k, as well as on our newly proposed and challenging MR40k benchmark, aimed specifically at sparsely populated regions. Further, we explore multi-candidate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications