TL;DR
This paper introduces GRAS, a comprehensive benchmark and bias score for measuring demographic biases in vision language models across multiple attributes, revealing significant biases and methodological considerations.
Contribution
The paper presents the first diverse bias benchmark (GRAS) and an interpretable bias metric for VLMs, along with benchmarking five state-of-the-art models.
Findings
All models exhibit notable biases, with the least biased scoring only 2 out of 100.
Multiple question formulations are necessary for accurate bias evaluation.
Code, data, and results are publicly available.
Abstract
As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
