Vision Language Models are Biased

An Vo; Khai-Nguyen Nguyen; Mohammad Reza Taesiri; Vy Tuong Dang; Anh Totti Nguyen; Daeyoung Kim

arXiv:2505.23941·cs.LG·April 21, 2026

Vision Language Models are Biased

An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, Daeyoung Kim

PDF

2 Repos 1 Models 2 Datasets 1 Video

TL;DR

This paper investigates biases in vision language models, revealing their tendency to produce inaccurate answers on visual tasks due to learned knowledge and contextual cues, and proposes a framework for testing these biases.

Contribution

It introduces a human-supervised automated framework to test biases in VLMs and demonstrates how background removal improves accuracy significantly.

Findings

01

VLMs scored only 17.05% accuracy on counting tasks across diverse domains.

02

Removing backgrounds nearly doubled counting accuracy to 21.09%.

03

Counting accuracy peaks at ~40% with moderate reasoning tokens before declining.

Abstract

Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 percentage points), revealing that contextual visual cues trigger these biased responses. Further analysis of VLMs'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
dennny123/visual-reasoner-8b
model· 103 dl
103 dl

Datasets

Videos

Vision Language Models are Biased· slideslive