CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Peng Gao; Shijie Geng; Renrui Zhang; Teli Ma; Rongyao Fang; Yongfeng; Zhang; Hongsheng Li; Yu Qiao

arXiv:2110.04544·cs.CV·March 26, 2025·111 cites

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng, Zhang, Hongsheng Li, Yu Qiao

PDF

Open Access 3 Repos

TL;DR

CLIP-Adapter introduces feature adapters for fine-tuning vision-language models, outperforming prompt tuning methods by enhancing feature representations with a simple residual bottleneck layer.

Contribution

The paper proposes CLIP-Adapter, a novel fine-tuning approach using feature adapters that improve vision-language model performance without complex prompt engineering.

Findings

01

Outperforms context optimization in various tasks

02

Maintains simplicity while enhancing performance

03

Effective across multiple visual classification benchmarks

Abstract

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in \cite{radford2021learning} to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization \cite{zhou2021coop} has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques