Beyond the Last Frame: Process-aware Evaluation for Generative Video Reasoning

Yifan Li; Yukai Gu; Yingqian Min; Zikang Liu; Yifan Du; Kun Zhou; Min Yang; Wayne Xin Zhao; Minghui Qiu

arXiv:2512.24952·cs.CV·January 15, 2026

Beyond the Last Frame: Process-aware Evaluation for Generative Video Reasoning

Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, Minghui Qiu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces VIPER, a comprehensive benchmark and a new evaluation metric for generative video reasoning, emphasizing process validation over outcome accuracy to prevent outcome-hacking.

Contribution

It presents VIPER, a multi-task benchmark for process-aware evaluation and POC@r, a hierarchical metric assessing both intermediate reasoning steps and final results.

Findings

01

State-of-the-art models achieve only about 20% [email protected].

02

Current models exhibit significant outcome-hacking.

03

There is a large gap between current video generation and true reasoning.

Abstract

Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Monosail/VIPER
dataset· 63 dl
63 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition