UNICBench: UNIfied Counting Benchmark for MLLM

Chenggang Rong; Tao Han; Zhiyuan Zhao; Yaowu Fan; Jia Wan; Song Guo; Yuan Yuan; Junyu Gao

arXiv:2603.00595·cs.CV·March 3, 2026

UNICBench: UNIfied Counting Benchmark for MLLM

Chenggang Rong, Tao Han, Zhiyuan Zhao, Yaowu Fan, Jia Wan, Song Guo, Yuan Yuan, Junyu Gao

PDF

Open Access

TL;DR

UNICBench is a comprehensive benchmark for evaluating counting abilities of multimodal large language models across image, text, and audio, revealing strengths and gaps in current models and providing a standardized evaluation toolkit.

Contribution

The paper introduces UNICBench, a unified, multi-modal counting benchmark with detailed evaluation protocols and a large annotated dataset for rigorous assessment of MLLMs.

Findings

01

Strong performance on basic counting tasks

02

Significant gaps in reasoning and complex partitions

03

Identifies long-tail errors and room for improvement

Abstract

Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality specific matching rules, we evaluate 45 state-of-the-art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long-tail errors and substantial headroom for improving general…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Text Readability and Simplification