Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing
Mathis J\"urgen Adler, Leonard Hackel, Gencer Sumbul, Beg\"um Demir

TL;DR
This paper introduces a weighted feature aggregation strategy for vision-language model pretraining in remote sensing, reducing redundancy from multiple captions and improving downstream text-to-image retrieval performance.
Contribution
It proposes a novel redundancy-aware pretraining method using importance weighting techniques to enhance remote sensing vision-language models.
Findings
Improved text-to-image retrieval accuracy in remote sensing tasks.
Effective reduction of redundant information from multiple captions.
Guidelines for selecting importance weighting techniques based on resource constraints.
Abstract
The development of foundation models through pretraining of vision-language models (VLMs) has recently attracted great attention in remote sensing (RS). VLM pretraining aims to learn image and language alignments from a large number of image-text pairs. Each pretraining image is often associated with multiple captions containing redundant information due to repeated or semantically similar phrases, resulting in increased pretraining and inference time. To overcome this, we introduce a weighted feature aggregation (WFA) strategy for VLM pretraining in RS. Our strategy aims to extract and exploit complementary information from multiple captions per image while reducing redundancies through feature aggregation with importance weighting. To calculate adaptive importance weights for different captions of each image, we propose two techniques: (i) non-parametric uniqueness and (ii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need
