Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models
Kuinan Hou, Jing Mi, Marco Zorzi, Lamberto Ballan, Alberto Testolin

TL;DR
This paper compares specialized counting architectures and vision-language models (VLMs) on object counting tasks, finding VLMs can match or surpass specialized models, especially when prompted to generate intermediate object representations, but still struggle in complex scenes.
Contribution
It provides a systematic comparison of domain-specific counting models and VLMs, highlighting the potential and limitations of VLMs for open-set object counting tasks.
Findings
VLMs can approximately enumerate objects, matching or surpassing specialized models.
Prompting VLMs to generate object locations and labels improves accuracy.
All models struggle with counting in complex visual scenes.
Abstract
Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
