Vision Language Models are Biased
An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, Daeyoung Kim

TL;DR
This paper investigates biases in vision language models, revealing their tendency to produce inaccurate answers on visual tasks due to learned knowledge and contextual cues, and proposes a framework for testing these biases.
Contribution
It introduces a human-supervised automated framework to test biases in VLMs and demonstrates how background removal improves accuracy significantly.
Findings
VLMs scored only 17.05% accuracy on counting tasks across diverse domains.
Removing backgrounds nearly doubled counting accuracy to 21.09%.
Counting accuracy peaks at ~40% with moderate reasoning tokens before declining.
Abstract
Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 percentage points), revealing that contextual visual cues trigger these biased responses. Further analysis of VLMs'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
