Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations
Haocheng Dai, Sarang Joshi

TL;DR
This paper investigates how visual representations in vision-language contrastive models like CLIP can be used to reduce biases inherited from training data, improving their robustness in downstream tasks.
Contribution
It demonstrates that visual features from CLIP are more effective than text embeddings in refining model perceptions and mitigating biases.
Findings
Visual representations outperform text embeddings in bias reduction.
A simple linear probe can extract core features for downstream tasks.
Using visual features enhances model robustness against dataset biases.
Abstract
Large vision-language contrastive models (VLCMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLCM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
