Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing

Mathis J\"urgen Adler; Leonard Hackel; Gencer Sumbul; Beg\"um Demir

arXiv:2505.11121·cs.CV·May 19, 2025

Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing

Mathis J\"urgen Adler, Leonard Hackel, Gencer Sumbul, Beg\"um Demir

PDF

Open Access

TL;DR

This paper introduces a weighted feature aggregation strategy for vision-language model pretraining in remote sensing, reducing redundancy from multiple captions and improving downstream text-to-image retrieval performance.

Contribution

It proposes a novel redundancy-aware pretraining method using importance weighting techniques to enhance remote sensing vision-language models.

Findings

01

Improved text-to-image retrieval accuracy in remote sensing tasks.

02

Effective reduction of redundant information from multiple captions.

03

Guidelines for selecting importance weighting techniques based on resource constraints.

Abstract

The development of foundation models through pretraining of vision-language models (VLMs) has recently attracted great attention in remote sensing (RS). VLM pretraining aims to learn image and language alignments from a large number of image-text pairs. Each pretraining image is often associated with multiple captions containing redundant information due to repeated or semantically similar phrases, resulting in increased pretraining and inference time. To overcome this, we introduce a weighted feature aggregation (WFA) strategy for VLM pretraining in RS. Our strategy aims to extract and exploit complementary information from multiple captions per image while reducing redundancies through feature aggregation with importance weighting. To calculate adaptive importance weights for different captions of each image, we propose two techniques: (i) non-parametric uniqueness and (ii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsSoftmax · Attention Is All You Need