CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation

Marc Lafon; Gustavo Adolfo Vargas Hakim; Cl\'ement Rambour; Christian Desrosier; Nicolas Thome

arXiv:2507.14312·cs.CV·September 22, 2025

CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation

Marc Lafon, Gustavo Adolfo Vargas Hakim, Cl\'ement Rambour, Christian Desrosier, Nicolas Thome

PDF

Open Access

TL;DR

CLIPTTA introduces a contrastive test-time adaptation method for vision-language models like CLIP, improving robustness under distribution shifts by aligning adaptation with the original training objective and extending to open-set scenarios.

Contribution

The paper proposes CLIPTTA, a gradient-based TTA method that aligns with CLIP's contrastive training, and introduces OCE loss for open-set adaptation, addressing failure modes of previous methods.

Findings

01

Outperforms entropy-based TTA methods across 75 datasets.

02

Provides theoretical analysis of gradient stability and collapse mitigation.

03

Enhances OOD detection with Outlier Contrastive Exposure (OCE) loss.

Abstract

Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but often fail to generalize under distribution shifts. Test-time adaptation (TTA) allows models to update at inference time without labeled data, typically via entropy minimization. However, this objective is fundamentally misaligned with the contrastive image-text training of VLMs, limiting adaptation performance and introducing failure modes such as pseudo-label drift and class collapse. We propose CLIPTTA, a new gradient-based TTA method for vision-language models that leverages a soft contrastive loss aligned with CLIP's pre-training objective. We provide a theoretical analysis of CLIPTTA's gradients, showing how its batch-aware design mitigates the risk of collapse. We further extend CLIPTTA to the open-set setting, where both in-distribution (ID) and out-of-distribution (OOD) samples are encountered,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications