PhyGround: Benchmarking Physical Reasoning in Generative World Models
Juyi Lin, Arash Akbari, Yumei He, Lin Zhao, Haichao Zhang, Arman Akbari, Xingchen Xu, Zoe Y. Lu, Enfu Nan, Hokin Deng, Edmund Yeh, Sarah Ostadabbas, Yun Fu, Jennifer Dy, Pu Zhao, Yanzhi Wang

TL;DR
PhyGround is a comprehensive benchmark with human-annotated prompts and diagnostics for evaluating physical reasoning in video generation models, addressing limitations of previous benchmarks.
Contribution
It introduces a new physics-aware benchmark with detailed diagnostics, large-scale human annotations, and an open-source physics-specialized model for automated evaluation.
Findings
Eight video generation models evaluated with high inter-annotator agreement.
PhyJudge-9B outperforms Gemini-3.1-Pro with lower bias.
Large-scale human study with nearly 6,000 annotations conducted.
Abstract
Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
