TL;DR
This paper introduces a synthetic benchmark and analysis framework to evaluate how vision-language models perform in counting tasks, revealing systematic performance degradation and potential improvements through attention reweighting.
Contribution
It develops a controlled diagnostic framework and synthetic dataset to analyze counting behavior and attention mechanisms in vision-language models, highlighting failure modes and potential interventions.
Findings
Counting accuracy decreases with increased visual and linguistic complexity.
Attention reweighting in the language decoder can modestly improve counting performance.
Systematic analysis exposes cross-modal binding failure modes not evident in standard benchmarks.
Abstract
Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require selective visual attention, a demand that mirrors cognitive challenges observed in human enumeration tasks. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically characterize how counting performance varies as image and prompt properties change. Using open-source VLMs, we analyze how performance shifts across controlled perturbations (e.g. number of objects, object color, background color, object texture, background texture, and prompt specificity) and examine corresponding changes in visual attention allocation. We further conduct exploratory attention reweighting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
