Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment
Angelos Zavras, Dimitrios Michail, Beg\"um Demir, Ioannis Papoutsis

TL;DR
This paper proposes a two-stage method to adapt CLIP for remote sensing imagery by aligning multiple modalities, significantly improving zero-shot classification and retrieval without extensive retraining or task-specific data.
Contribution
It introduces a novel two-stage approach combining robust fine-tuning and cross-modal alignment to extend CLIP's capabilities to remote sensing modalities.
Findings
Significant performance improvements on RS classification benchmarks.
Effective cross-modal retrieval without textual descriptions.
No need for task-specific parameters or training from scratch.
Abstract
Deep Learning (DL) is undergoing a paradigm shift with the emergence of foundation models. In this work, we focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across various image classification tasks and often rivals fully supervised baselines, despite not being explicitly trained for those tasks. Nevertheless, there are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery. These domains do not only exhibit fundamentally different distributions compared to natural images, but also commonly rely on complementary modalities, beyond RGB, to derive meaningful insights. To this end, we propose a methodology to align distinct RS image modalities with the visual and textual modalities of CLIP. Our two-stage procedure addresses the aforementioned distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeographic Information Systems Studies · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsFocus · Contrastive Language-Image Pre-training
