TL;DR
VideoZeroBench is a new hierarchical benchmark for long-video question answering that rigorously verifies spatio-temporal evidence, exposing gaps in current models' grounded video understanding capabilities.
Contribution
It introduces a challenging, multi-level evaluation protocol and a comprehensive dataset with annotated spatio-temporal evidence for long-video QA.
Findings
Gemini-3-Pro answers fewer than 17% of questions correctly under standard conditions.
Performance drops below 1% accuracy when strict evidence grounding is required.
Most models fail to produce correct grounded predictions, highlighting a gap in grounded video reasoning.
Abstract
Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
