Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

Saurav Sengupta; Nazanin Moradinasab; Jiebei Liu; Donald E. Brown

arXiv:2511.17722·cs.CV·April 6, 2026

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

Saurav Sengupta, Nazanin Moradinasab, Jiebei Liu, Donald E. Brown

PDF

1 Repo

TL;DR

This paper introduces a synthetic benchmark and analysis framework to evaluate how vision-language models perform in counting tasks, revealing systematic performance degradation and potential improvements through attention reweighting.

Contribution

It develops a controlled diagnostic framework and synthetic dataset to analyze counting behavior and attention mechanisms in vision-language models, highlighting failure modes and potential interventions.

Findings

01

Counting accuracy decreases with increased visual and linguistic complexity.

02

Attention reweighting in the language decoder can modestly improve counting performance.

03

Systematic analysis exposes cross-modal binding failure modes not evident in standard benchmarks.

Abstract

Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require selective visual attention, a demand that mirrors cognitive challenges observed in human enumeration tasks. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically characterize how counting performance varies as image and prompt properties change. Using open-source VLMs, we analyze how performance shifts across controlled perturbations (e.g. number of objects, object color, background color, object texture, background texture, and prompt specificity) and examine corresponding changes in visual attention allocation. We further conduct exploratory attention reweighting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ssen7/vlm-count-analysis
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.