CLIP's Visual Embedding Projector is a Few-shot Cornucopia
Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick P\'erez, Raoul de Charette

TL;DR
ProLIP is a simple, architecture-agnostic method that fine-tunes vision encoders for few-shot classification, achieving state-of-the-art results and excelling in transfer and adaptation tasks.
Contribution
It introduces ProLIP, a novel regularization-based fine-tuning approach for contrastively pretrained models, and proposes RLA, a hyperparameter-free linear adapter.
Findings
State-of-the-art performance on 11 few-shot benchmarks
Superior in transfer, domain generalization, and test-time adaptation
Faster training compared to prompt tuning
Abstract
We introduce ProLIP, a simple and architecture-agnostic method for adapting contrastively pretrained vision-language models, such as CLIP, to few-shot classification. ProLIP fine-tunes the vision encoder's projection matrix with Frobenius norm regularization on its deviation from the pretrained weights. It achieves state-of-the-art performance on 11 few-shot classification benchmarks under both ``few-shot validation'' and ``validation-free'' settings. Moreover, by rethinking the non-linear CLIP-Adapter through ProLIP's lens, we design a Regularized Linear Adapter (RLA) that performs better, requires no hyperparameter tuning, is less sensitive to learning rate values, and offers an alternative to ProLIP in black-box scenarios where model weights are inaccessible. Beyond few-shot classification, ProLIP excels in cross-dataset transfer, domain generalization, base-to-new class…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGastroesophageal reflux and treatments · Esophageal Cancer Research and Treatment
MethodsContrastive Language-Image Pre-training
