Multilingual Vision-Language Pre-training for the Remote Sensing Domain

Jo\~ao Daniel Silva; Joao Magalhaes; Devis Tuia; Bruno Martins

arXiv:2410.23370·cs.CV·November 1, 2024

Multilingual Vision-Language Pre-training for the Remote Sensing Domain

Jo\~ao Daniel Silva, Joao Magalhaes, Devis Tuia, Bruno Martins

PDF

1 Repo 2 Models

TL;DR

This paper introduces RS-M-CLIP, a multilingual vision-language model for remote sensing that leverages fine-tuning of multilingual CLIP and self-supervised alignment, achieving state-of-the-art results in various tasks.

Contribution

It presents a novel multilingual remote sensing model combining fine-tuning and self-supervised alignment, utilizing translated datasets for improved performance.

Findings

01

Translated data improves performance in multiple languages.

02

RS-M-CLIP achieves state-of-the-art results in cross-modal retrieval.

03

The model performs well in zero-shot image classification.

Abstract

Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific domain has relied on model fine-tuning with the standard contrastive objective, using existing human-labeled image-caption datasets, or using synthetic data corresponding to image-caption pairs derived from other annotations over remote sensing images (e.g., object classes). The use of different pre-training mechanisms has received less attention, and only a few exceptions have considered multilingual inputs. This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model and testing the use of a self-supervised method based on aligning local and global representations from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DannielSilva/RS-M-CLIP
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training