TL;DR
This paper introduces RS-M-CLIP, a multilingual vision-language model for remote sensing that leverages fine-tuning of multilingual CLIP and self-supervised alignment, achieving state-of-the-art results in various tasks.
Contribution
It presents a novel multilingual remote sensing model combining fine-tuning and self-supervised alignment, utilizing translated datasets for improved performance.
Findings
Translated data improves performance in multiple languages.
RS-M-CLIP achieves state-of-the-art results in cross-modal retrieval.
The model performs well in zero-shot image classification.
Abstract
Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific domain has relied on model fine-tuning with the standard contrastive objective, using existing human-labeled image-caption datasets, or using synthetic data corresponding to image-caption pairs derived from other annotations over remote sensing images (e.g., object classes). The use of different pre-training mechanisms has received less attention, and only a few exceptions have considered multilingual inputs. This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model and testing the use of a self-supervised method based on aligning local and global representations from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
