CountQA: How Well Do MLLMs Count in the Wild?

Jayant Sravan Tamarapalli; Rynaa Grover; Nilay Pande; Sahiti Yerramilli

arXiv:2508.06585·cs.AI·September 10, 2025

CountQA: How Well Do MLLMs Count in the Wild?

Jayant Sravan Tamarapalli, Rynaa Grover, Nilay Pande, Sahiti Yerramilli

PDF

Open Access 1 Datasets

TL;DR

CountQA introduces a challenging benchmark to evaluate and improve the object counting ability of Multimodal Large Language Models in complex, real-world scenarios, revealing significant performance gaps.

Contribution

This paper presents CountQA, the first comprehensive benchmark for assessing MLLMs' object counting in realistic, cluttered images, highlighting their current limitations and guiding future improvements.

Findings

01

Top model achieves 42.9% accuracy on CountQA

02

Performance drops as object counts increase

03

Benchmark reveals significant counting weaknesses in MLLMs

Abstract

Multimodal Large Language Models (MLLMs) demonstrate remarkable fluency in understanding visual scenes, yet they exhibit a critical lack in a fundamental cognitive skill: object counting. This blind spot severely limits their reliability in real-world applications. To date, this capability has been largely unevaluated in complex scenarios, as existing benchmarks either feature sparse object densities or are confined to specific visual domains, failing to test models under realistic conditions. Addressing this gap, we introduce CountQA, a challenging new benchmark designed to probe this deficiency. Comprising over 1,500 question-answer pairs, CountQA features real-world images with high object density, clutter, and occlusion. We investigate this weakness by evaluating 15 prominent MLLMs on the CountQA benchmark and reveal that the top-performing model achieves a mere 42.9% accuracy, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Jayant-Sravan/CountQA
dataset· 635 dl
635 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications