DiffCLIP: Few-shot Language-driven Multimodal Classifier

Jiaqing Zhang; Mingxiang Cao; Xue Yang; Kai Jiang; Yunsong Li

arXiv:2412.07119·cs.CV·December 11, 2024

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Jiaqing Zhang, Mingxiang Cao, Xue Yang, Kai Jiang, Yunsong Li

PDF

Open Access 1 Repo

TL;DR

DiffCLIP is a novel few-shot learning framework that extends CLIP for accurate classification of high-dimensional multimodal remote sensing images using minimal labeled data and unsupervised learning techniques.

Contribution

It introduces a new framework combining unsupervised mask diffusion learning and modality-shared encoders to enhance CLIP's performance in specialized remote sensing domains.

Findings

01

Achieves 10.65% accuracy improvement over CLIP on remote sensing datasets.

02

Effectively utilizes only 2-shot image-text pairs for training.

03

Demonstrates strong performance in few-shot multimodal remote sensing classification.

Abstract

Visual language models like Contrastive Language-Image Pretraining (CLIP) have shown impressive performance in analyzing natural images with language information. However, these models often encounter challenges when applied to specialized domains such as remote sensing due to the limited availability of image-text pairs for training. To tackle this issue, we introduce DiffCLIP, a novel framework that extends CLIP to effectively convey comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images. DiffCLIP is a few-shot learning method that leverages unlabeled images for pretraining. It employs unsupervised mask diffusion learning to capture the distribution of diverse modalities without requiring labels. The modality-shared image encoder maps multimodal data into a unified subspace, extracting shared features with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

icey-zhang/diffclip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsDiffusion · Contrastive Language-Image Pre-training