VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing
Emanuel S\'anchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, Yonghao Xu, Michael Felsberg

TL;DR
VLM2GeoVec is a unified vision-language model designed for remote sensing that effectively combines retrieval and spatial reasoning, outperforming specialized models across various remote sensing tasks.
Contribution
The paper introduces VLM2GeoVec, a single-encoder model trained contrastively to embed diverse remote sensing inputs into a unified space, enabling versatile applications without task-specific modules.
Findings
Achieves 26.6% P@1 on region-caption retrieval, outperforming dual-encoder baselines.
Attains 32.5% P@1 on referring-expression retrieval, surpassing previous methods.
Reaches 17.8% P@1 on semantic geo-localization, over three times prior best.
Abstract
Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose , an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
