DiffCLIP: Few-shot Language-driven Multimodal Classifier
Jiaqing Zhang, Mingxiang Cao, Xue Yang, Kai Jiang, Yunsong Li

TL;DR
DiffCLIP is a novel few-shot learning framework that extends CLIP for accurate classification of high-dimensional multimodal remote sensing images using minimal labeled data and unsupervised learning techniques.
Contribution
It introduces a new framework combining unsupervised mask diffusion learning and modality-shared encoders to enhance CLIP's performance in specialized remote sensing domains.
Findings
Achieves 10.65% accuracy improvement over CLIP on remote sensing datasets.
Effectively utilizes only 2-shot image-text pairs for training.
Demonstrates strong performance in few-shot multimodal remote sensing classification.
Abstract
Visual language models like Contrastive Language-Image Pretraining (CLIP) have shown impressive performance in analyzing natural images with language information. However, these models often encounter challenges when applied to specialized domains such as remote sensing due to the limited availability of image-text pairs for training. To tackle this issue, we introduce DiffCLIP, a novel framework that extends CLIP to effectively convey comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images. DiffCLIP is a few-shot learning method that leverages unlabeled images for pretraining. It employs unsupervised mask diffusion learning to capture the distribution of diverse modalities without requiring labels. The modality-shared image encoder maps multimodal data into a unified subspace, extracting shared features with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsDiffusion · Contrastive Language-Image Pre-training
