Data Matters Most: Auditing Social Bias in Contrastive Vision Language Models
Zahraa Al Sahili, Ioannis Patras, Matthew Purver

TL;DR
This paper investigates how model size, training data scale, and data source influence social biases in vision-language models, revealing data source as the primary bias driver and evaluating debiasing methods.
Contribution
It systematically compares models with identical objectives but different data sources and sizes, highlighting data source as the key factor in bias and debiasing effectiveness.
Findings
Increasing encoder size reduces gender bias in CLIP but amplifies racial bias in OpenCLIP.
Expanding LAION dataset increases racial bias in OpenCLIP.
Data source choice significantly impacts bias patterns and debiasing success.
Abstract
Vision-language models (VLMs) deliver strong zero-shot recognition but frequently inherit social biases from their training data. We systematically disentangle three design factors -- model size, training-data scale, and training-data source -- by comparing CLIP and OpenCLIP, two models that share an identical contrastive objective yet differ in encoder width and in the image-text corpora on which they are pre-trained (400M proprietary pairs vs. 400M/2B LAION). Across balanced face-analysis benchmarks, enlarging the encoder reduces gender skew in CLIP but amplifies both gender and racial skew in OpenCLIP; increasing the LAION corpus from 400M to 2B further increases OpenCLIP bias. At matched model and data budgets, substituting proprietary data with LAION improves gender fairness while increasing racial skew, underscoring data source as the primary driver of bias patterns. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicslinguistics and terminology studies · Text Readability and Simplification
MethodsDiffusion · Contrastive Language-Image Pre-training
