Global Cross-Modal Geo-Localization: A Million-Scale Dataset and a Physical Consistency Learning Framework

Yutong Hu; Jinhui Chen; Chaoqiang Xu; Yuan Kou; Sili Zhou; Shaocheng Yan; Pengcheng Shi; Qingwu Hu; Jiayuan Li

arXiv:2603.08491·cs.CV·March 10, 2026

Global Cross-Modal Geo-Localization: A Million-Scale Dataset and a Physical Consistency Learning Framework

Yutong Hu, Jinhui Chen, Chaoqiang Xu, Yuan Kou, Sili Zhou, Shaocheng Yan, Pengcheng Shi, Qingwu Hu, Jiayuan Li

PDF

Open Access

TL;DR

This paper introduces CORE, a large-scale global dataset for cross-modal geo-localization, and proposes PLANET, a physical-law-aware network that leverages contrastive learning to improve localization accuracy across diverse environments.

Contribution

The paper presents the first million-scale, geographically diverse dataset CORE and a novel physical-law-aware network PLANET for enhanced global cross-modal geo-localization.

Findings

01

PLANET outperforms existing methods on diverse geographic regions.

02

CORE dataset enables more robust and universal geo-localization models.

03

High-quality scene descriptions improve cross-modal matching accuracy.

Abstract

Cross-modal Geo-localization (CMGL) matches ground-level text descriptions with geo-tagged aerial imagery, which is crucial for pedestrian navigation and emergency response. However, existing researches are constrained by narrow geographic coverage and simplistic scene diversity, failing to reflect the immense spatial heterogeneity of global architectural styles and topographic features. To bridge this gap and facilitate universal positioning, we introduce CORE, the first million-scale dataset dedicated to global CMGL. CORE comprises 1,034,786 cross-view images sampled from 225 distinct geographic regions across all continents, offering an unprecedented variety of perspectives in varying environmental conditions and urban layouts. We leverage the zero-shot reasoning of Large Vision-Language Models (LVLMs) to synthesize high-quality scene descriptions rich in discriminative cues.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques