TL;DR
This paper investigates why vision-language models struggle with simple counting tasks, revealing that visual evidence is underused during reasoning and proposing interventions to improve counting accuracy.
Contribution
It introduces COUNTINGTRICKS, a controlled evaluation suite, and analyzes model behavior, highlighting the importance of visual evidence during language reasoning.
Findings
Visual evidence is strongest in the modality projection stage.
Counting failures are due to visual perception limits and underuse of visual evidence in language reasoning.
Modality Attention Share improves counting performance.
Abstract
Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple shape-based counting cases designed to expose vulnerabilities under different patchification layouts and adversarial prompting conditions. Using attention analysis and component-wise probing, we show that count-relevant visual evidence is strongest in the modality projection stage but degrades substantially in later language layers, where models become more susceptible to text priors. Motivated by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
