CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections
Mohamed Fazli Imam, Rufael Fedaku Marew, Jameel Hassan, Mustansar, Fiaz, Alham Fikri Aji, Hisham Cholakkal

TL;DR
This paper introduces NoLA, a label-free prompt-tuning method that combines DINO's visual features and LLM-based textual embeddings to significantly improve zero-shot image classification performance with unlabeled data.
Contribution
It proposes a novel three-step framework that leverages LLMs and DINO to enhance CLIP's zero-shot classification without labeled data, surpassing existing methods.
Findings
Achieves 3.6% average gain over state-of-the-art LaFTer
Effectively utilizes unlabeled images for improved classification
Outperforms previous label-free classification approaches
Abstract
In the era of foundation models, CLIP has emerged as a powerful tool for aligning text & visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL models require an additional supervised linear probing step, which relies on fully labeled data which is often expensive and difficult to obtain at scale. In this paper, we propose a label-free prompt-tuning method that leverages the rich visual features of self-supervised learning models (DINO) and the broad textual knowledge of large language models (LLMs) to largely enhance CLIP-based image classification performance using unlabeled images. Our approach unfolds in three key steps: (1) We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Image Processing Techniques and Applications · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Softmax · Linear Layer · Dense Connections · Layer Normalization · Multi-Head Attention · Residual Connection · Vision Transformer · self-DIstillation with NO labels · Contrastive Language-Image Pre-training
