VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Jiahao Meng; Tan Yue; Qi Xu; Haochen Wang; Zhongwei Ren; Weisong Liu; Yuhao Wang; Renrui Zhang; Yunhai Tong; Haodong Duan

arXiv:2604.01569·cs.CV·April 3, 2026

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Jiahao Meng, Tan Yue, Qi Xu, Haochen Wang, Zhongwei Ren, Weisong Liu, Yuhao Wang, Renrui Zhang, Yunhai Tong, Haodong Duan

PDF

1 Repo

TL;DR

VideoZeroBench is a new hierarchical benchmark for long-video question answering that rigorously verifies spatio-temporal evidence, exposing gaps in current models' grounded video understanding capabilities.

Contribution

It introduces a challenging, multi-level evaluation protocol and a comprehensive dataset with annotated spatio-temporal evidence for long-video QA.

Findings

01

Gemini-3-Pro answers fewer than 17% of questions correctly under standard conditions.

02

Performance drops below 1% accuracy when strict evidence grounding is required.

03

Most models fail to produce correct grounded predictions, highlighting a gap in grounded video reasoning.

Abstract

Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marinero4972/VideoZeroBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.