Cross-Modal Concept Learning and Inference for Vision-Language Models
Yi Zhang, Ce Zhang, Yushun Tang, Zhihai He

TL;DR
This paper introduces CCLI, a novel approach that leverages semantic concepts to improve vision-language model classification, significantly enhancing performance in few-shot learning and domain generalization tasks.
Contribution
The paper proposes a new method called CCLI that learns visual concepts from images using CLIP and constructs discriminative representations for better classification.
Findings
Improves few-shot learning accuracy by up to 8.0%.
Enhances domain generalization performance by up to 1.3%.
Outperforms current state-of-the-art methods significantly.
Abstract
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
