Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI
Robert Wolfe, Aayushi Dangol, Alexis Hiniker, Bill Howe

TL;DR
This study investigates how dataset size and societal consistency influence facial impression biases in CLIP vision-language models and their impact on generative AI, revealing that larger, culturally diverse datasets lead to more human-like and subtle biases.
Contribution
It demonstrates that dataset scale and societal consensus shape the emergence of facial impression biases in CLIP models and their transfer to generative models like Stable Diffusion.
Findings
Biases reflect societal consensus across models.
Larger datasets produce more human-like biases.
Biases intersect with racial biases in generative models.
Abstract
Multimodal AI models capable of associating images and text hold promise for numerous domains, ranging from automated image captioning to accessibility applications for blind and low-vision users. However, uncertainty about bias has in some cases limited their adoption and availability. In the present work, we study 43 CLIP vision-language models to determine whether they learn human-like facial impression biases, and we find evidence that such biases are reflected across three distinct CLIP model families. We show for the first time that the the degree to which a bias is shared across a society predicts the degree to which it is reflected in a CLIP model. Human-like impressions of visually unobservable attributes, like trustworthiness and sexuality, emerge only in models trained on the largest dataset, indicating that a better fit to uncurated cultural data results in the reproduction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education · Face Recognition and Perception
MethodsContrastive Language-Image Pre-training · Diffusion
