Transductive Zero-Shot and Few-Shot CLIP
S\'egol\`ene Martin (OPIS, CVN), Yunshi Huang (ETS), Fereshteh Shakeri, (ETS), Jean-Christophe Pesquet (OPIS, CVN), Ismail Ben Ayed (ETS)

TL;DR
This paper introduces a transductive inference method for CLIP that jointly classifies batches of unlabeled data, significantly improving zero-shot and few-shot image classification accuracy through a novel EM-inspired optimization approach.
Contribution
It proposes a new transductive inference framework for CLIP using a Dirichlet-based EM-inspired algorithm, enhancing classification performance on multiple datasets.
Findings
20% accuracy improvement on ImageNet zero-shot tasks
Outperforms state-of-the-art in few-shot classification
Effective batch inference method for vision-language models
Abstract
Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge, in which inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. We initially construct informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our optimization-based classification objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm, which simultaneously estimates the distribution parameters and class…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
