Boosting Vision-Language Models with Transduction

Maxime Zanella; Beno\^it G\'erin; Ismail Ben Ayed

arXiv:2406.01837·cs.CV·June 5, 2024·1 cites

Boosting Vision-Language Models with Transduction

Maxime Zanella, Beno\^it G\'erin, Ismail Ben Ayed

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces TransCLIP, a transductive approach that enhances vision-language models by leveraging unlabeled data through a novel objective and optimization method, significantly improving zero- and few-shot learning performance.

Contribution

The paper presents TransCLIP, a plug-and-play transductive framework with a new KL-regularized objective and an efficient BMM optimization, advancing vision-language model capabilities.

Findings

01

TransCLIP improves generalization of zero- and few-shot VLMs.

02

It outperforms standard transductive methods relying only on vision features.

03

KL-based language constraints are key to performance gains.

Abstract

Transduction is a powerful paradigm that leverages the structure of unlabeled data to boost predictive accuracy. We present TransCLIP, a novel and computationally efficient transductive approach designed for Vision-Language Models (VLMs). TransCLIP is applicable as a plug-and-play module on top of popular inductive zero- and few-shot models, consistently improving their performances. Our new objective function can be viewed as a regularized maximum-likelihood estimation, constrained by a KL divergence penalty that integrates the text-encoder knowledge and guides the transductive learning process. We further derive an iterative Block Majorize-Minimize (BMM) procedure for optimizing our objective, with guaranteed convergence and decoupled sample-assignment updates, yielding computationally efficient transduction for large-scale datasets. We report comprehensive evaluations, comparisons,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MaxZanella/transduction-for-vlms
pytorchOfficial

Videos

Boosting Vision-Language Models with Transduction· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications