Embody4D: A Generalist 4D World Model for Embodied AI

Peiyan Tu; Hanxin Zhu; Jingwen Sun; Shaojie Ren; Cong Wang; Jiayi Luo; Xiaoqian Cheng; Zhibo Chen

arXiv:2605.01799·cs.CV·May 5, 2026

Embody4D: A Generalist 4D World Model for Embodied AI

Peiyan Tu, Hanxin Zhu, Jingwen Sun, Shaojie Ren, Cong Wang, Jiayi Luo, Xiaoqian Cheng, Zhibo Chen

PDF

TL;DR

Embody4D introduces a novel 4D world model for embodied AI that synthesizes consistent multi-view videos from monocular inputs, overcoming data scarcity and fidelity challenges to enhance robotic planning.

Contribution

The paper presents Embody4D, a comprehensive 4D world model with a new data synthesis pipeline, adaptive regularization, and interaction-aware attention, advancing embodied spatial reasoning.

Findings

01

Achieves state-of-the-art view synthesis quality.

02

Demonstrates improved robotic planning and learning.

03

Ensures spatiotemporal consistency in generated videos.

Abstract

World models have made significant progress in modeling dynamic environments; however, most embodied world models are still restricted to 2D representations, lacking the comprehensive multi-view information essential for embodied spatial reasoning. Bridging this gap is non-trivial, primarily due to challenges from severe scarcity of paired multi-view data, the difficulty of maintaining spatiotemporal consistency in generated 3D geometries, and the tendency to hallucinate manipulation details. To address these challenges, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios, capable of synthesizing arbitrary novel views from a monocular video. First, to tackle data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, guaranteeing broad…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.