Counting to Four is still a Chore for VLMs

Duy Le Dinh Anh; Patrick Amadeus Irawan; Tuan Van Vo

arXiv:2604.10039·cs.CV·April 14, 2026

Counting to Four is still a Chore for VLMs

Duy Le Dinh Anh, Patrick Amadeus Irawan, Tuan Van Vo

PDF

1 Repo

TL;DR

This paper investigates why vision-language models struggle with simple counting tasks, revealing that visual evidence is underused during reasoning and proposing interventions to improve counting accuracy.

Contribution

It introduces COUNTINGTRICKS, a controlled evaluation suite, and analyzes model behavior, highlighting the importance of visual evidence during language reasoning.

Findings

01

Visual evidence is strongest in the modality projection stage.

02

Counting failures are due to visual perception limits and underuse of visual evidence in language reasoning.

03

Modality Attention Share improves counting performance.

Abstract

Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leduy99/-CVPRW26-Modality-Attention-Share
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.