Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations

Haocheng Dai; Sarang Joshi

arXiv:2405.14030·cs.CV·July 10, 2025

Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations

Haocheng Dai, Sarang Joshi

PDF

Open Access

TL;DR

This paper investigates how visual representations in vision-language contrastive models like CLIP can be used to reduce biases inherited from training data, improving their robustness in downstream tasks.

Contribution

It demonstrates that visual features from CLIP are more effective than text embeddings in refining model perceptions and mitigating biases.

Findings

01

Visual representations outperform text embeddings in bias reduction.

02

A simple linear probe can extract core features for downstream tasks.

03

Using visual features enhances model robustness against dataset biases.

Abstract

Large vision-language contrastive models (VLCMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLCM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training