Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning

Qin Zhang; Peiyu Jing; Hong-Xing Yu; Fangqiang Ding; Fan Nie; Weimin Wang; Yilun Du; James Zou; Jiajun Wu; Bing Shuai

arXiv:2603.19607·cs.CV·March 23, 2026

Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning

Qin Zhang, Peiyu Jing, Hong-Xing Yu, Fangqiang Ding, Fan Nie, Weimin Wang, Yilun Du, James Zou, Jiajun Wu, Bing Shuai

PDF

Open Access 1 Datasets

TL;DR

Physion-Eval is a comprehensive benchmark that uses expert human reasoning to evaluate physical realism in generated videos, revealing significant physical glitches in current models and aiming to improve physics-grounded video generation.

Contribution

Introduces Physion-Eval, a large-scale dataset with expert annotations for diagnosing physical realism failures in generated videos, advancing evaluation methods beyond automated metrics.

Findings

01

83.3% of exocentric videos show physical glitches

02

93.5% of egocentric videos exhibit physical glitches

03

Physion-Eval sets a new standard for physical realism assessment

Abstract

Video generation models are increasingly used as world simulators for storytelling, simulation, and embodied AI. As these models advance, a key question arises: do generated videos obey the physical laws of the real world? Existing evaluations largely rely on automated metrics or coarse human judgments such as preferences or rubric-based checks. While useful for assessing perceptual quality, these methods provide limited insight into when and why generated dynamics violate real-world physical constraints. We introduce Physion-Eval, a large-scale benchmark of expert human reasoning for diagnosing physical realism failures in videos generated by five state-of-the-art models across egocentric and exocentric views, containing 10,990 expert reasoning traces spanning 22 fine-grained physical categories. Each generated video is derived from a corresponding real-world reference video depicting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

PhysionLabs/Physion-Eval
dataset· 122 dl
122 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications