Prototype-Based Test-Time Adaptation of Vision-Language Models
Zhaohong Huang, Yuxin Zhang, Wenjing Liu, Fei Chao, Rongrong Ji

TL;DR
This paper introduces Prototype-Based Test-Time Adaptation (PTA), a highly efficient method for vision-language models that improves accuracy and speed by using class-specific prototypes without cache overhead.
Contribution
PTA is a novel TTA approach that adaptively updates class prototypes based on test samples, eliminating cache-related inefficiencies and achieving state-of-the-art results.
Findings
PTA improves CLIP accuracy from 65.64% to 69.38% on 10 benchmarks.
PTA retains 92% of CLIP's inference speed on large-scale datasets.
PTA outperforms cache-based TTA methods in both accuracy and efficiency.
Abstract
Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
