Visual Enumeration Remains Challenging for Multimodal Generative AI
Alberto Testolin, Kuinan Hou, Marco Zorzi

TL;DR
This paper introduces benchmark tasks to evaluate the enumeration capabilities of multimodal AI models, revealing their significant limitations in counting objects accurately and systematically, unlike humans.
Contribution
It proposes new cognitive-inspired benchmarks for assessing AI counting skills and provides a comprehensive analysis of current models' deficiencies in visual enumeration.
Findings
AI models perform poorly on counting tasks, especially outside the subitizing range.
Models often produce errors influenced by object category, unlike human behavior.
Increasing model size does not significantly improve counting accuracy.
Abstract
Many animal species can approximately judge the number of objects in a visual scene at a single glance, and humans can further determine the exact cardinality of a set by deploying systematic counting procedures. In contrast, it has been observed that even state-of-the-art AI systems have very limited enumeration skills. In this work, we propose two benchmark tasks inspired by cognitive science that allow to precisely evaluate the visual enumeration capabilities of multimodal foundation models, thereby providing an objective measure of their number sense and counting level. We consider popular visual question answering models (BLIP, LLaVA and ViLT) as well as advanced image-to-text (Gemini, GPT and Qwen) and text-to-image (DALL-E, FLUX and Stable Diffusion) AI systems. Our analyses show that even the most advanced models cannot reliably name the number of objects in simple visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis
