CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models
Reem Alzahrani, Hassan Alshanqiti, Bushra Bin Hemid, Zaid Alyafeai, Abdelrahman Eldesokey, Bernard Ghanem

TL;DR
CounterCount is a diagnostic framework that tests vision-language models' reliance on visual evidence versus priors in counting tasks, revealing their biases and guiding improvements.
Contribution
It introduces a novel counterfactual counting benchmark with localized annotations and an attention modulation method to enhance model accuracy.
Findings
Models perform well on factual images but degrade with counterfactual attribute changes.
Failures are due to underweighting attention to count-relevant visual tokens.
Attention reweighting improves counting accuracy by up to 8%.
Abstract
Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
