ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

Haotian Xue; Yipu Chen; Liqian Ma; Zelin Zhao; Lama Moukheiber; Yuchen Zhu; Yongxin Chen

arXiv:2605.08567·cs.CV·May 19, 2026

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

Haotian Xue, Yipu Chen, Liqian Ma, Zelin Zhao, Lama Moukheiber, Yuchen Zhu, Yongxin Chen

PDF

TL;DR

ACWM-Phys introduces a comprehensive benchmark for evaluating action-conditioned video world models across diverse physical dynamics, highlighting challenges in out-of-distribution generalization and guiding future model improvements.

Contribution

The paper presents ACWM-Phys, a new benchmark environment with diverse physical interactions and evaluation protocols for assessing generalized physical prediction in world models.

Findings

01

Models perform well on simple, geometric interactions.

02

Generalization drops significantly with deformable and complex motions.

03

Cross-attention and causal VAEs improve modeling high-dimensional actions.

Abstract

Action-conditioned world models (ACWMs) have shown strong promise for video prediction and decision-making. However, existing benchmarks are largely restricted to egocentric navigation or narrow, task-specific robotics datasets, offering only limited coverage of the rich physical interactions required for generalized world understanding. We introduce ACWM-Phys, a new benchmark for evaluating action-conditioned prediction under diverse physical dynamics in a clean, controllable simulation environment with a carefully designed action space. ACWM-Phys contains training and evaluation data spanning rigid-body dynamics, kinematics, deformable-object interactions, and particle dynamics. To evaluate both interpolation and generalization, we design in-distribution and out-of-distribution protocols with controlled shifts in interaction patterns or scene configurations. By building the benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.