Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges
Liling Yang, Ning Chen, Jun Yue, Yidan Liu, Jiayi Ma, Pedram Ghamisi, Antonio Plaza, Leyuan Fang

TL;DR
This survey comprehensively reviews multimodal geospatial foundation models, highlighting techniques, applications, and challenges in remote sensing, and evaluates their performance across various real-world tasks.
Contribution
It provides a systematic overview of multimodal GFMs, analyzing their architectures, techniques, benchmarks, and practical applications in remote sensing.
Findings
Multimodal GFMs effectively address modality heterogeneity and semantic gaps.
Benchmark evaluations reveal strengths and limitations of current models.
Case studies demonstrate GFMs' practical impact in land cover, agriculture, and disaster response.
Abstract
Foundation models have transformed natural language processing and computer vision, and their impact is now reshaping remote sensing image analysis. With powerful generalization and transfer learning capabilities, they align naturally with the multimodal, multi-resolution, and multi-temporal characteristics of remote sensing data. To address unique challenges in the field, multimodal geospatial foundation models (GFMs) have emerged as a dedicated research frontier. This survey delivers a comprehensive review of multimodal GFMs from a modality-driven perspective, covering five core visual and vision-language modalities. We examine how differences in imaging physics and data representation shape interaction design, and we analyze key techniques for alignment, integration, and knowledge transfer to tackle modality heterogeneity, distribution shifts, and semantic gaps. Advances in training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
