Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives
Xingxing Weng, Chao Pang, Gui-Song Xia

TL;DR
This paper reviews recent advances in vision-language models for remote sensing, covering architectures, datasets, and future directions, highlighting their ability to improve data analysis and user interaction in the domain.
Contribution
It provides a comprehensive taxonomy, detailed analysis of existing models, datasets, and discusses future research challenges in remote sensing vision-language modeling.
Findings
VLM models achieve strong performance across remote sensing tasks.
Contrastive, instruction tuning, and generative models are key categories.
Large-scale datasets and architectural innovations drive progress.
Abstract
Vision-language modeling (VLM) aims to bridge the information gap between images and natural language. Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress. The resulting models benefit from the absorption of extensive general knowledge and demonstrate strong performance across a variety of remote sensing data analysis tasks. Moreover, they are capable of interacting with users in a conversational manner. In this paper, we aim to provide the remote sensing community with a timely and comprehensive review of the developments in VLM using the two-stage paradigm. Specifically, we first cover a taxonomy of VLM in remote sensing: contrastive learning, visual instruction tuning, and text-conditioned image generation. For each category, we detail the commonly used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
