VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Jiashuo Yu; Yue Wu; Meng Chu; Zhifei Ren; Zizheng Huang; Pei Chu; Ruijie Zhang; Yinan He; Qirui Li; Songze Li; Zhenxiang Li; Zhongying Tu; Conghui He; Yu Qiao; Yali Wang; Yi Wang; Limin Wang

arXiv:2506.10857·cs.CV·August 5, 2025

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang

PDF

Open Access 1 Datasets

TL;DR

VRBench is a comprehensive benchmark for evaluating large models' multi-step reasoning in long narrative videos, addressing temporal reasoning and procedural validity with extensive datasets and a multi-phase evaluation pipeline.

Contribution

It introduces the first long narrative video benchmark with multi-step reasoning annotations and a novel evaluation framework for assessing reasoning chains in models.

Findings

01

12 LLMs evaluated with detailed analysis

02

19 VLMs assessed for multi-step reasoning capabilities

03

Proposed scoring metric evaluates reasoning quality comprehensively

Abstract

We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 960 long videos (with an average duration of 1.6 hours), along with 8,243 human-labeled multi-step question-answering pairs and 25,106 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

OpenGVLab/VRBench
dataset· 491 dl
491 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Artificial Intelligence in Games