Transductive Zero-Shot and Few-Shot CLIP

S\'egol\`ene Martin (OPIS; CVN); Yunshi Huang (ETS); Fereshteh Shakeri; (ETS); Jean-Christophe Pesquet (OPIS; CVN); Ismail Ben Ayed (ETS)

arXiv:2405.18437·cs.CV·May 30, 2024

Transductive Zero-Shot and Few-Shot CLIP

S\'egol\`ene Martin (OPIS, CVN), Yunshi Huang (ETS), Fereshteh Shakeri, (ETS), Jean-Christophe Pesquet (OPIS, CVN), Ismail Ben Ayed (ETS)

PDF

TL;DR

This paper introduces a transductive inference method for CLIP that jointly classifies batches of unlabeled data, significantly improving zero-shot and few-shot image classification accuracy through a novel EM-inspired optimization approach.

Contribution

It proposes a new transductive inference framework for CLIP using a Dirichlet-based EM-inspired algorithm, enhancing classification performance on multiple datasets.

Findings

01

20% accuracy improvement on ImageNet zero-shot tasks

02

Outperforms state-of-the-art in few-shot classification

03

Effective batch inference method for vision-language models

Abstract

Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge, in which inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. We initially construct informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our optimization-based classification objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm, which simultaneously estimates the distribution parameters and class…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.