Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

TL;DR
This paper introduces a reinforcement learning-based test-time adaptation method using CLIP feedback to improve zero-shot generalization of vision-language models under distribution shifts.
Contribution
It proposes RLCF, a flexible framework that uses CLIP as a reward model during test-time adaptation, extending beyond classification to retrieval and captioning tasks.
Findings
RLCF improves zero-shot performance across multiple VL tasks.
The method effectively prevents models from becoming overconfident during adaptation.
Experimental results show significant gains over baseline methods.
Abstract
One fascinating aspect of pre-trained vision-language models~(VLMs) learning under language supervision is their impressive zero-shot generalization capability. However, this ability is hindered by distribution shifts between the training and testing data. Previous test time adaptation~(TTA) methods for VLMs in zero-shot classification rely on minimizing the entropy of model outputs, tending to be stuck in incorrect model predictions. In this work, we propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident. Specifically, a CLIP model is adopted as the reward model during TTA and provides feedback for the VLM. Given a single test sample, the VLM is forced to maximize the CLIP reward between the input and sampled results from the VLM output distribution. The proposed \textit{reinforcement learning with CLIP feedback~(RLCF)} framework is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsTest · ALIGN · Contrastive Language-Image Pre-training
