VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

Emanuel S\'anchez Aimar; Gulnaz Zhambulova; Fahad Shahbaz Khan; Yonghao Xu; Michael Felsberg

arXiv:2512.11490·cs.CV·December 15, 2025

VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

Emanuel S\'anchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, Yonghao Xu, Michael Felsberg

PDF

Open Access

TL;DR

VLM2GeoVec is a unified vision-language model designed for remote sensing that effectively combines retrieval and spatial reasoning, outperforming specialized models across various remote sensing tasks.

Contribution

The paper introduces VLM2GeoVec, a single-encoder model trained contrastively to embed diverse remote sensing inputs into a unified space, enabling versatile applications without task-specific modules.

Findings

01

Achieves 26.6% P@1 on region-caption retrieval, outperforming dual-encoder baselines.

02

Attains 32.5% P@1 on referring-expression retrieval, surpassing previous methods.

03

Reaches 17.8% P@1 on semantic geo-localization, over three times prior best.

Abstract

Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $VLM2GeoVec$ , an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques