ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
Haotian Xue, Yipu Chen, Liqian Ma, Zelin Zhao, Lama Moukheiber, Yuchen Zhu, Yongxin Chen

TL;DR
ACWM-Phys introduces a comprehensive benchmark for evaluating action-conditioned video world models across diverse physical dynamics, highlighting challenges in out-of-distribution generalization and guiding future model improvements.
Contribution
The paper presents ACWM-Phys, a new benchmark environment with diverse physical interactions and evaluation protocols for assessing generalized physical prediction in world models.
Findings
Models perform well on simple, geometric interactions.
Generalization drops significantly with deformable and complex motions.
Cross-attention and causal VAEs improve modeling high-dimensional actions.
Abstract
Action-conditioned world models (ACWMs) have shown strong promise for video prediction and decision-making. However, existing benchmarks are largely restricted to egocentric navigation or narrow, task-specific robotics datasets, offering only limited coverage of the rich physical interactions required for generalized world understanding. We introduce ACWM-Phys, a new benchmark for evaluating action-conditioned prediction under diverse physical dynamics in a clean, controllable simulation environment with a carefully designed action space. ACWM-Phys contains training and evaluation data spanning rigid-body dynamics, kinematics, deformable-object interactions, and particle dynamics. To evaluate both interpolation and generalization, we design in-distribution and out-of-distribution protocols with controlled shifts in interaction patterns or scene configurations. By building the benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
