TL;DR
This paper adapts Multimodal Large Language Models for natural-language guided geo-localization, achieving state-of-the-art results with a simple, parameter-efficient fine-tuning approach that outperforms traditional methods.
Contribution
It introduces a novel, effective framework for fine-tuning MLLMs for NGCG, enabling strong cross-modal alignment without complex architectural changes.
Findings
Achieved 12.2% improvement in Text-to-Image Recall@1 on GeoText-1652.
Secured top performance in 5 out of 12 subtasks on CVG-Text.
Surpassed baseline methods with fewer trainable parameters.
Abstract
Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directly optimized for retrieval tasks. In this work, we present a simple yet effective framework to adapt MLLMs for NGCG via parameter-efficient finetuning. Our approach optimizes latent representations within the MLLM while preserving its pretrained multimodal knowledge, enabling strong cross-modal alignment without redesigning model architectures. Through systematic analysis of diverse variables, from model backbone to feature aggregation, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
