Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model
Huan Ma, Yan Zhu, Changqing Zhang, Peilin Zhao, Baoyuan Wu, Long-Kai, Huang, Qinghua Hu, Bingzhe Wu

TL;DR
This paper introduces Spurious Feature Eraser, a test-time prompt tuning method that enhances vision-language models' robustness by removing spurious features, thereby improving their generalization on downstream tasks.
Contribution
The paper proposes a novel test-time prompt tuning approach to erase spurious features, improving the generalization of vision-language models like CLIP on downstream tasks.
Findings
Significant performance improvements over existing methods.
Effective suppression of decision shortcuts during inference.
Enhanced reliance on invariant causal features.
Abstract
Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data. However, these models also display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of ``decision shortcuts'' that hinder their generalization capabilities. In this work, we find that the CLIP model possesses a rich set of features, encompassing both \textit{desired invariant causal features} and \textit{undesired decision shortcuts}. Moreover, the underperformance of CLIP on downstream tasks originates from its inability to effectively utilize pre-trained features in accordance with specific task requirements. To address this challenge, we propose a simple yet effective method, Spurious Feature Eraser (SEraser), to alleviate the decision shortcuts by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
