Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

Bo Feng; Zhengfeng Lai; Shiyu Li; Zizhen Wang; Simon Wang; Ping Huang; Meng Cao

arXiv:2505.14321·cs.CV·May 21, 2025

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

Bo Feng, Zhengfeng Lai, Shiyu Li, Zizhen Wang, Simon Wang, Ping Huang, Meng Cao

PDF

Open Access

TL;DR

This paper critically examines video understanding benchmarks, revealing limitations like language priors and shuffling invariance, and introduces VBenchComp, a pipeline for more precise evaluation of models' temporal reasoning abilities.

Contribution

The paper identifies key issues in current benchmarks and proposes VBenchComp, a novel pipeline that categorizes questions to better assess models' true temporal understanding.

Findings

01

Models often rely on language priors, not video content.

02

Shuffling frames does not always degrade model performance.

03

VBenchComp enables fine-grained evaluation of temporal reasoning.

Abstract

Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuction Theory and Applications · Digital Rights Management and Security · Artificial Intelligence in Law