CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image   Collections

Mohamed Fazli Imam; Rufael Fedaku Marew; Jameel Hassan; Mustansar; Fiaz; Alham Fikri Aji; Hisham Cholakkal

arXiv:2411.19346·cs.CV·April 11, 2025

CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections

Mohamed Fazli Imam, Rufael Fedaku Marew, Jameel Hassan, Mustansar, Fiaz, Alham Fikri Aji, Hisham Cholakkal

PDF

Open Access 1 Repo

TL;DR

This paper introduces NoLA, a label-free prompt-tuning method that combines DINO's visual features and LLM-based textual embeddings to significantly improve zero-shot image classification performance with unlabeled data.

Contribution

It proposes a novel three-step framework that leverages LLMs and DINO to enhance CLIP's zero-shot classification without labeled data, surpassing existing methods.

Findings

01

Achieves 3.6% average gain over state-of-the-art LaFTer

02

Effectively utilizes unlabeled images for improved classification

03

Outperforms previous label-free classification approaches

Abstract

In the era of foundation models, CLIP has emerged as a powerful tool for aligning text & visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL models require an additional supervised linear probing step, which relies on fully labeled data which is often expensive and difficult to obtain at scale. In this paper, we propose a label-free prompt-tuning method that leverages the rich visual features of self-supervised learning models (DINO) and the broad textual knowledge of large language models (LLMs) to largely enhance CLIP-based image classification performance using unlabeled images. Our approach unfolds in three key steps: (1) We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fazliimam/NoLA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI · Image Processing Techniques and Applications · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Softmax · Linear Layer · Dense Connections · Layer Normalization · Multi-Head Attention · Residual Connection · Vision Transformer · self-DIstillation with NO labels · Contrastive Language-Image Pre-training