Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions

Wenxuan Bao; Ruxi Deng; Jingrui He

arXiv:2510.22127·cs.CV·October 28, 2025

Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions

Wenxuan Bao, Ruxi Deng, Jingrui He

PDF

1 Video

TL;DR

This paper introduces Mint, a simple test-time adaptation method that enhances vision-language models like CLIP against corruptions by maximizing inter-class variance, thereby improving robustness and embedding quality.

Contribution

The paper uncovers embedding variance collapse in CLIP under corruptions and proposes Mint, a novel, effective test-time adaptation technique based on variance maximization.

Findings

01

Mint improves robustness across corruption benchmarks.

02

Maximizing inter-class variance enhances embedding discriminability.

03

Theoretical analysis links variance collapse to performance degradation.

Abstract

Pretrained vision-language models such as CLIP achieve strong zero-shot generalization but remain vulnerable to distribution shifts caused by input corruptions. In this work, we investigate how corruptions affect CLIP's image embeddings and uncover a consistent phenomenon we term as embedding variance collapse, where both intra-class and inter-class variances shrink as corruption severity increases. We find that this collapse is closely tied to performance degradation, with inter-class variance strongly correlated with classification accuracy. To explain this phenomenon, we analyze how corruptions alter the structure of the embedding space. Our theoretical results suggest that the visual encoder tends to encode corruption-related signals, which dilute class-discriminative features and compress the representation geometry. We further show that maximizing inter-class variance, even when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions· slideslive