Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

Jun Guo; Qiwei Li; Peiyan Li; Zilong Chen; Nan Sun; Yifei Su; Heyun Wang; Yuan Zhang; Xinghang Li; Huaping Liu

arXiv:2604.26694·cs.RO·May 8, 2026

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

Jun Guo, Qiwei Li, Peiyan Li, Zilong Chen, Nan Sun, Yifei Su, Heyun Wang, Yuan Zhang, Xinghang Li, Huaping Liu

PDF

TL;DR

X-WAM is a unified 4D world model that combines real-time robotic action and high-fidelity 4D world synthesis using video priors, with an asynchronous denoising method for efficient inference.

Contribution

It introduces X-WAM, a novel framework integrating 4D world modeling and action execution, leveraging pretrained video diffusion models and asynchronous noise sampling for efficiency.

Findings

01

Achieves 79.2% success on RoboCasa benchmark

02

Achieves 90.7% success on RoboTwin 2.0 benchmark

03

Produces superior 4D reconstruction and generation results

Abstract

We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.