Towards Vision-Language Geo-Foundation Model: A Survey
Yue Zhou, Zhihang Zhong, Xue Yang

TL;DR
This survey reviews the development of Vision-Language Geo-Foundation Models (VLGFMs), emphasizing their unique geospatial data integration, core technologies, applications, and future research directions in the field.
Contribution
It is the first comprehensive review of VLGFMs, systematically summarizing recent advances, core methodologies, and discussing future challenges in geospatial multimodal modeling.
Findings
VLGFMs leverage large-scale geospatial multimodal data.
Core technologies include specialized data construction and model architectures.
VLGFMs show promising applications in earth observation tasks.
Abstract
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding. However, most methods rely on training with general image datasets, and the lack of geospatial data leads to poor performance on earth observation. Numerous geospatial image-text pair datasets and VLFMs fine-tuned on them have been proposed recently. These new approaches aim to leverage large-scale, multimodal geospatial data to build versatile intelligent models with diverse geo-perceptive capabilities, which we refer to as Vision-Language Geo-Foundation Models (VLGFMs). This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field. In particular, we introduce the background and motivation behind the rise of VLGFMs, highlighting their unique research…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
