Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in   Vision-Language Models

Shuai Zhao; Xiaohan Wang; Linchao Zhu; Yi Yang

arXiv:2305.18010·cs.CV·February 22, 2024·5 cites

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a reinforcement learning-based test-time adaptation method using CLIP feedback to improve zero-shot generalization of vision-language models under distribution shifts.

Contribution

It proposes RLCF, a flexible framework that uses CLIP as a reward model during test-time adaptation, extending beyond classification to retrieval and captioning tasks.

Findings

01

RLCF improves zero-shot performance across multiple VL tasks.

02

The method effectively prevents models from becoming overconfident during adaptation.

03

Experimental results show significant gains over baseline methods.

Abstract

One fascinating aspect of pre-trained vision-language models~(VLMs) learning under language supervision is their impressive zero-shot generalization capability. However, this ability is hindered by distribution shifts between the training and testing data. Previous test time adaptation~(TTA) methods for VLMs in zero-shot classification rely on minimizing the entropy of model outputs, tending to be stuck in incorrect model predictions. In this work, we propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident. Specifically, a CLIP model is adopted as the reward model during TTA and provides feedback for the VLM. Given a single test sample, the VLM is forced to maximize the CLIP reward between the input and sampled results from the VLM output distribution. The proposed \textit{reinforcement learning with CLIP feedback~(RLCF)} framework is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mzhaoshuai/rlcf
pytorchOfficial

Videos

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsTest · ALIGN · Contrastive Language-Image Pre-training