EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos
Fumihiko Tsuchiya, Taiki Miyanishi, Mahiro Ukai, Nakamasa Inoue, Shuhei Kurita, Yusuke Iwasawa, and Yutaka Matsuo

TL;DR
EC-Bench is a new benchmark for evaluating enumeration, counting, and temporal grounding in ultra-long videos, exposing limitations of current models in long-range video reasoning.
Contribution
Introduces EC-Bench, a comprehensive long-video benchmark with explicit evidence spans, and provides a detailed evaluation of multimodal large language models on this challenging task.
Findings
Best model achieves less than 30% accuracy in enumeration.
Human performance exceeds 78% in enumeration and 83% in counting.
Strong correlation between enumeration accuracy, temporal grounding, and counting.
Abstract
Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
