VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning
Shaoyang Cui, Lingbei Meng

TL;DR
VidNum-1.4K is a new comprehensive benchmark for evaluating video-based numerical reasoning in vision-language models, emphasizing complex multi-step logic in diverse real-world videos.
Contribution
Introduces VidNum-1.4K, a diverse, hierarchical VideoQA benchmark designed to assess genuine numerical reasoning beyond superficial tasks.
Findings
State-of-the-art VLMs perform poorly on VidNum-1.4K, with accuracy mostly below 60%.
Current models lack a stable internal world model for complex numerical reasoning.
VidNum-1.4K serves as a challenging diagnostic for future video numerical reasoning models.
Abstract
Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly "understand" real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
