Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos
Taiyi Su, Jian Zhu, Yaxuan Li, Chong Ma, Jianjun Zhang, Zitai Huang, Hanli Wang, Yi Xu

TL;DR
This paper introduces MTV-World, a multi-view embodied world model that enhances visuomotor prediction and control accuracy by leveraging multi-view trajectory videos and an auto-evaluation pipeline, addressing spatial information loss issues.
Contribution
It proposes a novel multi-view framework for embodied world modeling that improves spatial consistency and control precision in complex robotic scenarios.
Findings
MTV-World achieves high control accuracy in dual-arm robotic tasks.
The multi-view approach compensates for spatial information loss in trajectory control.
The auto-evaluation pipeline effectively measures control and interaction accuracy.
Abstract
Embodied world models aim to predict and interact with the physical world through visual observations and actions. However, existing models struggle to accurately translate low-level actions (e.g., joint positions) into precise robotic movements in predicted frames, leading to inconsistencies with real-world physical interactions. To address these limitations, we propose MTV-World, an embodied world model that introduces Multi-view Trajectory-Video control for precise visuomotor prediction. Specifically, instead of directly using low-level actions for control, we employ trajectory videos obtained through camera intrinsic and extrinsic parameters and Cartesian-space transformation as control signals. However, projecting 3D raw actions onto 2D images inevitably causes a loss of spatial information, making a single view insufficient for accurate interaction modeling. To overcome this, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
