Ultra-Light Test-Time Adaptation for Vision--Language Models
Byunghyun Kim

TL;DR
UL-TTA is a training-free, lightweight test-time adaptation method for vision-language models that improves accuracy and calibration under domain shift by adapting only logit-level parameters with Bayesian updates.
Contribution
It introduces UL-TTA, a fully training-free, backprop-free framework that adapts only logit-level parameters using Bayesian updates, suitable for streaming and edge scenarios.
Findings
Consistently improves accuracy on large-scale benchmarks.
Reduces calibration error significantly.
Operates with less than 8% latency overhead.
Abstract
Vision-Language Models (VLMs) such as CLIP achieve strong zero-shot recognition by comparing image embeddings to text-derived class prototypes. However, under domain shift, they suffer from feature drift, class-prior mismatch, and severe miscalibration. Existing test-time adaptation (TTA) methods often require backpropagation through large backbones, covariance estimation, or heavy memory/state, which is problematic for streaming and edge scenarios. We propose Ultra-Light Test-Time Adaptation (UL-TTA), a fully training-free and backprop-free framework that freezes the backbone and adapts only logit-level parameters: class prototypes, class priors, and temperature. UL-TTA performs an online EM-style procedure with (i) selective sample filtering to use only confident predictions, (ii) closed-form Bayesian updates for prototypes and priors anchored by text and Dirichlet priors, (iii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
