Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

Angelos Zavras; Dimitrios Michail; Beg\"um Demir; Ioannis Papoutsis

arXiv:2402.09816·cs.CV·July 21, 2025·1 cites

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

Angelos Zavras, Dimitrios Michail, Beg\"um Demir, Ioannis Papoutsis

PDF

Open Access

TL;DR

This paper proposes a two-stage method to adapt CLIP for remote sensing imagery by aligning multiple modalities, significantly improving zero-shot classification and retrieval without extensive retraining or task-specific data.

Contribution

It introduces a novel two-stage approach combining robust fine-tuning and cross-modal alignment to extend CLIP's capabilities to remote sensing modalities.

Findings

01

Significant performance improvements on RS classification benchmarks.

02

Effective cross-modal retrieval without textual descriptions.

03

No need for task-specific parameters or training from scratch.

Abstract

Deep Learning (DL) is undergoing a paradigm shift with the emergence of foundation models. In this work, we focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across various image classification tasks and often rivals fully supervised baselines, despite not being explicitly trained for those tasks. Nevertheless, there are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery. These domains do not only exhibit fundamentally different distributions compared to natural images, but also commonly rely on complementary modalities, beyond RGB, to derive meaningful insights. To this end, we propose a methodology to align distinct RS image modalities with the visual and textual modalities of CLIP. Our two-stage procedure addresses the aforementioned distribution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsFocus · Contrastive Language-Image Pre-training