RISE-Video: Can Video Generators Decode Implicit World Rules?

Mingxin Liu; Shuran Ma; Shibei Meng; Xiangyu Zhao; Zicheng Zhang; Shaofeng Zhang; Zhihang Zhong; Peixian Chen; Haoyu Cao; Xing Sun; Haodong Duan; Xue Yang

arXiv:2602.05986·cs.CV·February 6, 2026

RISE-Video: Can Video Generators Decode Implicit World Rules?

Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang

PDF

Open Access 1 Datasets

TL;DR

RISE-Video introduces a new benchmark for evaluating video generators' ability to understand and reason over implicit world rules, highlighting current models' limitations in complex, constraint-driven scenarios.

Contribution

This paper presents RISE-Video, a novel reasoning-oriented benchmark with a multi-dimensional evaluation protocol and automated assessment pipeline for Text-Image-to-Video models.

Findings

01

Models show significant deficiencies in reasoning over implicit constraints.

02

The benchmark reveals gaps in temporal consistency and physical rationality.

03

Automated evaluation correlates well with human judgment.

Abstract

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

VisionXLab/RISE-Video
dataset· 1.1k dl
1.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)