Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
Jun Guo, Qiwei Li, Peiyan Li, Zilong Chen, Nan Sun, Yifei Su, Heyun Wang, Yuan Zhang, Xinghang Li, Huaping Liu

TL;DR
X-WAM is a unified 4D world model that combines real-time robotic action and high-fidelity 4D world synthesis using video priors, with an asynchronous denoising method for efficient inference.
Contribution
It introduces X-WAM, a novel framework integrating 4D world modeling and action execution, leveraging pretrained video diffusion models and asynchronous noise sampling for efficiency.
Findings
Achieves 79.2% success on RoboCasa benchmark
Achieves 90.7% success on RoboTwin 2.0 benchmark
Produces superior 4D reconstruction and generation results
Abstract
We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
