More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language Models
Messi H.J. Lee, Jacob M. Montgomery, Calvin K. Lai

TL;DR
This paper investigates how vision-language models like GPT-4V exhibit increased stereotyping when processing faces that are more distinctly Black and feminine, highlighting visual cues as key bias drivers.
Contribution
It reveals that VLMs' stereotyping is driven by visual cues rather than group membership, emphasizing the need to address visual bias sources in these models.
Findings
Faces rated as more prototypically Black and feminine lead to greater stereotyping.
VLMs rely on visual cues, not just group labels, to associate stereotypes.
Biases are more challenging to mitigate due to visual cue influence.
Abstract
Vision Language Models (VLMs), exemplified by GPT-4V, adeptly integrate text and vision modalities. This integration enhances Large Language Models' ability to mimic human perception, allowing them to process image inputs. Despite VLMs' advanced capabilities, however, there is a concern that VLMs inherit biases of both modalities in ways that make biases more pervasive and difficult to mitigate. Our study explores how VLMs perpetuate homogeneity bias and trait associations with regards to race and gender. When prompted to write stories based on images of human faces, GPT-4V describes subordinate racial and gender groups with greater homogeneity than dominant groups and relies on distinct, yet generally positive, stereotypes. Importantly, VLM stereotyping is driven by visual cues rather than group membership alone such that faces that are rated as more prototypically Black and feminine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedia, Religion, Digital Communication
