Cross-Modal Concept Learning and Inference for Vision-Language Models

Yi Zhang; Ce Zhang; Yushun Tang; Zhihai He

arXiv:2307.15460·cs.CV·July 31, 2023·1 cites

Cross-Modal Concept Learning and Inference for Vision-Language Models

Yi Zhang, Ce Zhang, Yushun Tang, Zhihai He

PDF

Open Access

TL;DR

This paper introduces CCLI, a novel approach that leverages semantic concepts to improve vision-language model classification, significantly enhancing performance in few-shot learning and domain generalization tasks.

Contribution

The paper proposes a new method called CCLI that learns visual concepts from images using CLIP and constructs discriminative representations for better classification.

Findings

01

Improves few-shot learning accuracy by up to 8.0%.

02

Enhances domain generalization performance by up to 1.3%.

03

Outperforms current state-of-the-art methods significantly.

Abstract

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training