Embody4D: A Generalist 4D World Model for Embodied AI
Peiyan Tu, Hanxin Zhu, Jingwen Sun, Shaojie Ren, Cong Wang, Jiayi Luo, Xiaoqian Cheng, Zhibo Chen

TL;DR
Embody4D introduces a novel 4D world model for embodied AI that synthesizes consistent multi-view videos from monocular inputs, overcoming data scarcity and fidelity challenges to enhance robotic planning.
Contribution
The paper presents Embody4D, a comprehensive 4D world model with a new data synthesis pipeline, adaptive regularization, and interaction-aware attention, advancing embodied spatial reasoning.
Findings
Achieves state-of-the-art view synthesis quality.
Demonstrates improved robotic planning and learning.
Ensures spatiotemporal consistency in generated videos.
Abstract
World models have made significant progress in modeling dynamic environments; however, most embodied world models are still restricted to 2D representations, lacking the comprehensive multi-view information essential for embodied spatial reasoning. Bridging this gap is non-trivial, primarily due to challenges from severe scarcity of paired multi-view data, the difficulty of maintaining spatiotemporal consistency in generated 3D geometries, and the tendency to hallucinate manipulation details. To address these challenges, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios, capable of synthesizing arbitrary novel views from a monocular video. First, to tackle data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, guaranteeing broad…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
