CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation
Marc Lafon, Gustavo Adolfo Vargas Hakim, Cl\'ement Rambour, Christian Desrosier, Nicolas Thome

TL;DR
CLIPTTA introduces a contrastive test-time adaptation method for vision-language models like CLIP, improving robustness under distribution shifts by aligning adaptation with the original training objective and extending to open-set scenarios.
Contribution
The paper proposes CLIPTTA, a gradient-based TTA method that aligns with CLIP's contrastive training, and introduces OCE loss for open-set adaptation, addressing failure modes of previous methods.
Findings
Outperforms entropy-based TTA methods across 75 datasets.
Provides theoretical analysis of gradient stability and collapse mitigation.
Enhances OOD detection with Outlier Contrastive Exposure (OCE) loss.
Abstract
Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but often fail to generalize under distribution shifts. Test-time adaptation (TTA) allows models to update at inference time without labeled data, typically via entropy minimization. However, this objective is fundamentally misaligned with the contrastive image-text training of VLMs, limiting adaptation performance and introducing failure modes such as pseudo-label drift and class collapse. We propose CLIPTTA, a new gradient-based TTA method for vision-language models that leverages a soft contrastive loss aligned with CLIP's pre-training objective. We provide a theoretical analysis of CLIPTTA's gradients, showing how its batch-aware design mitigates the risk of collapse. We further extend CLIPTTA to the open-set setting, where both in-distribution (ID) and out-of-distribution (OOD) samples are encountered,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
