Frustratingly Easy Test-Time Adaptation of Vision-Language Models
Matteo Farina, Gianni Franchi, Giovanni Iacca, Massimiliano Mancini,, Elisa Ricci

TL;DR
This paper introduces ZERO, a simple, fast, and memory-efficient test-time adaptation method for vision-language models that significantly improves performance without backpropagation.
Contribution
The authors reveal a hidden, effective TTA method within prompt tuning and propose ZERO, which requires only a single forward pass and no backpropagation, outperforming existing methods.
Findings
ZERO surpasses state-of-the-art TTA methods in accuracy.
ZERO is nearly 10x faster and 13x more memory-efficient.
ZERO is a strong, simple baseline for future TTA research.
Abstract
Vision-Language Models seamlessly discriminate among arbitrary semantic categories, yet they still suffer from poor generalization when presented with challenging examples. For this reason, Episodic Test-Time Adaptation (TTA) strategies have recently emerged as powerful techniques to adapt VLMs in the presence of a single unlabeled image. The recent literature on TTA is dominated by the paradigm of prompt tuning by Marginal Entropy Minimization, which, relying on online backpropagation, inevitably slows down inference while increasing memory. In this work, we theoretically investigate the properties of this approach and unveil that a surprisingly strong TTA method lies dormant and hidden within it. We term this approach ZERO (TTA with "zero" temperature), whose design is both incredibly effective and frustratingly simple: augment N times, predict, retain the most confident predictions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsSoftmax
