A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test-Time Adaptation for Vision-Language Models
Mario D\"obler, Robert A. Marsden, Tobias Raichle, Bin Yang

TL;DR
This paper systematically evaluates online test-time adaptation techniques for vision-language models like CLIP, highlighting their potential and limitations in improving robustness under distribution shifts through various prompt and ensemble strategies.
Contribution
It provides a comprehensive comparison of prompt-based and test-time adaptation methods for vision-language models, introducing a vision-text ensemble approach to enhance robustness.
Findings
Test-time adaptation improves model robustness under distribution shifts.
Ensemble strategies outperform single prompt methods.
Existing adaptation methods have limitations in real-world scenarios.
Abstract
In deep learning, maintaining model robustness against distribution shifts is critical. This work explores a broad range of possibilities to adapt vision-language foundation models at test-time, with a particular emphasis on CLIP and its variants. The study systematically examines prompt-based techniques and existing test-time adaptation methods, aiming to improve the robustness under distribution shift in diverse real-world scenarios. Specifically, the investigation covers various prompt engineering strategies, including handcrafted prompts, prompt ensembles, and prompt learning techniques. Additionally, we introduce a vision-text-space ensemble that substantially enhances average performance compared to text-space-only ensembles. Since online test-time adaptation has shown to be effective to mitigate performance drops under distribution shift, the study extends its scope to evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsFocus · Contrastive Language-Image Pre-training
