Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models

Kuinan Hou; Jing Mi; Marco Zorzi; Lamberto Ballan; Alberto Testolin

arXiv:2512.15254·cs.CV·December 18, 2025

Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models

Kuinan Hou, Jing Mi, Marco Zorzi, Lamberto Ballan, Alberto Testolin

PDF

Open Access

TL;DR

This paper compares specialized counting architectures and vision-language models (VLMs) on object counting tasks, finding VLMs can match or surpass specialized models, especially when prompted to generate intermediate object representations, but still struggle in complex scenes.

Contribution

It provides a systematic comparison of domain-specific counting models and VLMs, highlighting the potential and limitations of VLMs for open-set object counting tasks.

Findings

01

VLMs can approximately enumerate objects, matching or surpassing specialized models.

02

Prompting VLMs to generate object locations and labels improves accuracy.

03

All models struggle with counting in complex visual scenes.

Abstract

Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications