VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

Pengyiang Liu; Zhongyue Shi; Hongye Hao; Qi Fu; Xueting Bi; Siwei Zhang; Xiaoyang Hu; Zitian Wang; Linjiang Huang; Si Liu

arXiv:2603.12703·cs.CV·March 26, 2026

VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

Pengyiang Liu, Zhongyue Shi, Hongye Hao, Qi Fu, Xueting Bi, Siwei Zhang, Xiaoyang Hu, Zitian Wang, Linjiang Huang, Si Liu

PDF

Open Access

TL;DR

VCBench is a new streaming counting benchmark designed to evaluate and diagnose the ability of video understanding models to maintain spatial-temporal world state, revealing significant deficiencies in current models.

Contribution

It introduces a comprehensive streaming counting benchmark with detailed annotations and metrics for diagnosing spatial-temporal state maintenance in video models.

Findings

01

Current models struggle with spatial-temporal state maintenance.

02

Models show deficiencies in periodic event counting.

03

VCBench provides a diagnostic framework for improvement.

Abstract

Video understanding requires models to continuously track and update world state during playback. While existing benchmarks have advanced video understanding evaluation across multiple dimensions, the observation of how models maintain world state remains insufficient. We propose VCBench, a streaming counting benchmark that repositions counting as a minimal probe for diagnosing world state maintenance capability. We decompose this capability into object counting and event counting, forming 8 fine-grained subcategories. Object counting covers tracking currently visible objects and cumulative unique identities, while event counting covers detecting instantaneous actions and tracking complete activity cycles. VCBench contains 406 videos with frame-by-frame annotations of 10,071 event occurrence moments and object state change moments, generating 1,000 streaming QA pairs with 4,576 query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Anomaly Detection Techniques and Applications