VideoPhy: Evaluating Physical Commonsense for Video Generation
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom,, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, Aditya Grover

TL;DR
VideoPhy introduces a benchmark to evaluate whether current text-to-video models generate videos that adhere to physical commonsense, revealing significant gaps in their ability to simulate real-world physics.
Contribution
The paper presents VideoPhy, a new benchmark and auto-evaluator for assessing physical commonsense in video generation models, highlighting their current limitations.
Findings
Existing models often fail to generate physically plausible videos.
The best model, CogVideoX-5B, adheres to physics in only 39.6% of cases.
VideoPhy exposes the gap between current capabilities and real-world physical understanding.
Abstract
Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts, synthesize realistic motions and render complex objects. Hence, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models. To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate diverse prompts that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions…
Peer Reviews
Decision·ICLR 2025 Poster
The goal of evaluating physical commonsense is quite important for this area. The chosen video generators are comprehensive, including open-source models and closed ones. The presentation is well-organized and easy to follow, with key numbers regarding evaluations.
Regarding evaluation, I have several major concerns. - As mentioned in Sec.3.1, binary feedback (0/1) is used to evaluate semantic adherence and physical commonsense. This discrete value may not reflect and moniter the true capability for different video generators. For example, for a text prompt with 10 physical movements, one generator achieves 8 movements while another is 6. These binary feedback can not tell the gap between two candidates. This example could be too extreme while that could
* The paper shifts attention from general visual and semantic quality to the capability of T2V models to simulate real-world physics, addressing a vital aspect of realism in video generation. * The paper provides detailed insights into different failure modes, guiding future model improvements and research directions. * The automation pipeline, VideoCon-Physics enables scalable assessment of semantic adherence and physical commonsense in generated videos, which can be useful and meaningful to th
* Among the T2V models used for comparison, some still frequently fail to reproduce the scenarios specified by the text prompts. For example, in assessing physical reasoning in a scenario where milk is being poured, one needs to verify whether the milk appropriately fills the cup. However, in practice, these models often fail even to generate a video depicting the act of pouring milk. In such cases, the benchmark may be more influenced by the general video generation capabilities of the models r
1. This paper presents a physical commonsense benchmark that addresses a gap in existing datasets. The field of video generation needs such a benchmark, as t2v models gain popularity partly for their potential as world simulators or physical engines. 2. The dataset covers a wide range of activities, interactions, and dynamics on various materials as mentioned in line 194/195. The classification of dataset is somehow inspired by graphics field, and the difficulty of simulate those dynamics is al
1. I am not an expert in graphics or materials, so I am very uncertain about the category definitions and category ratios. I can get the idea of categorizing based on the state of matter, and solid and liquid are the most common. However, in graphics, rigid bodies, soft bodies, particle systems, fabrics, characters and animals are distinct topics that rely on very different physical models, whereas fluid dynamics, such as inviscid and viscous flows, are comparatively less diverse. If the idea of
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Video Analysis and Summarization
