IDEA: Image Description Enhanced CLIP-Adapter
Zhipeng Ye, Feng Jiang, Qiufeng Wang, Kaizhu Huang, Jiaqi Huang

TL;DR
The paper introduces IDEA, a novel method that enhances CLIP's few-shot image classification by leveraging image textual descriptions, achieving state-of-the-art results without additional training.
Contribution
It proposes a training-free image description enhancement for CLIP and extends it with trainable components, significantly improving performance on multiple datasets.
Findings
IDEA achieves comparable or better results than state-of-the-art models.
T-IDEA further improves performance with lightweight learnable modules.
Generated 1.6 million image-text pairs for dataset enhancement.
Abstract
CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Medical Image Segmentation Techniques · Radiomics and Machine Learning in Medical Imaging
MethodsFocus · Contrastive Language-Image Pre-training · LLaMA · Adapter
