Visual Enumeration Remains Challenging for Multimodal Generative AI

Alberto Testolin; Kuinan Hou; Marco Zorzi

arXiv:2402.03328·cs.CV·July 29, 2025·1 cites

Visual Enumeration Remains Challenging for Multimodal Generative AI

Alberto Testolin, Kuinan Hou, Marco Zorzi

PDF

Open Access 1 Repo

TL;DR

This paper introduces benchmark tasks to evaluate the enumeration capabilities of multimodal AI models, revealing their significant limitations in counting objects accurately and systematically, unlike humans.

Contribution

It proposes new cognitive-inspired benchmarks for assessing AI counting skills and provides a comprehensive analysis of current models' deficiencies in visual enumeration.

Findings

01

AI models perform poorly on counting tasks, especially outside the subitizing range.

02

Models often produce errors influenced by object category, unlike human behavior.

03

Increasing model size does not significantly improve counting accuracy.

Abstract

Many animal species can approximately judge the number of objects in a visual scene at a single glance, and humans can further determine the exact cardinality of a set by deploying systematic counting procedures. In contrast, it has been observed that even state-of-the-art AI systems have very limited enumeration skills. In this work, we propose two benchmark tasks inspired by cognitive science that allow to precisely evaluate the visual enumeration capabilities of multimodal foundation models, thereby providing an objective measure of their number sense and counting level. We consider popular visual question answering models (BLIP, LLaVA and ViLT) as well as advanced image-to-text (Gemini, GPT and Qwen) and text-to-image (DALL-E, FLUX and Stable Diffusion) AI systems. Our analyses show that even the most advanced models cannot reliably name the number of objects in simple visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ccnl-unipd/numbersense-ai
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis