WATT: Weight Average Test-Time Adaptation of CLIP

David Osowiechi; Mehrdad Noori; Gustavo Adolfo Vargas Hakim; Moslem; Yazdanpanah; Ali Bahri; Milad Cheraghalikhani; Sahar Dastani; Farzad Beizaee,; Ismail Ben Ayed; Christian Desrosiers

arXiv:2406.13875·cs.CV·June 26, 2024

WATT: Weight Average Test-Time Adaptation of CLIP

David Osowiechi, Mehrdad Noori, Gustavo Adolfo Vargas Hakim, Moslem, Yazdanpanah, Ali Bahri, Milad Cheraghalikhani, Sahar Dastani, Farzad Beizaee,, Ismail Ben Ayed, Christian Desrosiers

PDF

Open Access 1 Repo

TL;DR

WATT enhances CLIP's zero-shot image classification by employing test-time adaptation with weight averaging and text ensemble strategies, significantly improving performance across various domain-shifted datasets without additional training.

Contribution

This paper introduces WATT, a novel test-time adaptation method for CLIP that uses pseudo labels, weight averaging, and text ensemble strategies to improve robustness without extra training modules.

Findings

01

WATT improves CLIP's performance on multiple domain-shifted datasets.

02

The method operates effectively with just a single image per test case.

03

WATT does not require additional model training or transformations.

Abstract

Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performance for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts. In response, we present Weight Average Test-Time Adaptation (WATT) of CLIP, a pioneering approach facilitating full test-time adaptation (TTA) of this VLM. Our method employs a diverse set of templates for text prompts, augmenting the existing framework of CLIP. Predictions are utilized as pseudo labels for model updates, followed by weight averaging to consolidate the learned information globally. Furthermore, we introduce a text ensemble strategy, enhancing overall test performance by aggregating diverse textual cues. Our findings underscore the efficacy of WATT in enhancing performance across diverse datasets, including CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, VisDA-C,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mehrdad-noori/watt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Coding and Compression Technologies · Embedded Systems Design Techniques · VLSI and Analog Circuit Testing

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training