Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives

Xingxing Weng; Chao Pang; Gui-Song Xia

arXiv:2505.14361·cs.CV·June 11, 2025

Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives

Xingxing Weng, Chao Pang, Gui-Song Xia

PDF

TL;DR

This paper reviews recent advances in vision-language models for remote sensing, covering architectures, datasets, and future directions, highlighting their ability to improve data analysis and user interaction in the domain.

Contribution

It provides a comprehensive taxonomy, detailed analysis of existing models, datasets, and discusses future research challenges in remote sensing vision-language modeling.

Findings

01

VLM models achieve strong performance across remote sensing tasks.

02

Contrastive, instruction tuning, and generative models are key categories.

03

Large-scale datasets and architectural innovations drive progress.

Abstract

Vision-language modeling (VLM) aims to bridge the information gap between images and natural language. Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress. The resulting models benefit from the absorption of extensive general knowledge and demonstrate strong performance across a variety of remote sensing data analysis tasks. Moreover, they are capable of interacting with users in a conversational manner. In this paper, we aim to provide the remote sensing community with a timely and comprehensive review of the developments in VLM using the two-stage paradigm. Specifically, we first cover a taxonomy of VLM in remote sensing: contrastive learning, visual instruction tuning, and text-conditioned image generation. For each category, we detail the commonly used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.