TL;DR
This paper introduces Mint, a simple test-time adaptation method that enhances vision-language models like CLIP against corruptions by maximizing inter-class variance, thereby improving robustness and embedding quality.
Contribution
The paper uncovers embedding variance collapse in CLIP under corruptions and proposes Mint, a novel, effective test-time adaptation technique based on variance maximization.
Findings
Mint improves robustness across corruption benchmarks.
Maximizing inter-class variance enhances embedding discriminability.
Theoretical analysis links variance collapse to performance degradation.
Abstract
Pretrained vision-language models such as CLIP achieve strong zero-shot generalization but remain vulnerable to distribution shifts caused by input corruptions. In this work, we investigate how corruptions affect CLIP's image embeddings and uncover a consistent phenomenon we term as embedding variance collapse, where both intra-class and inter-class variances shrink as corruption severity increases. We find that this collapse is closely tied to performance degradation, with inter-class variance strongly correlated with classification accuracy. To explain this phenomenon, we analyze how corruptions alter the structure of the embedding space. Our theoretical results suggest that the visual encoder tends to encode corruption-related signals, which dilute class-discriminative features and compress the representation geometry. We further show that maximizing inter-class variance, even when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
