RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization
Ruicheng Zhang, Guangyu Chen, Zunnan Xu, Zihao Liu, Zhizhou Zhong, Mingyang Zhang, Jun Zhou, Xiu Li

TL;DR
RoboStereo introduces a dual-tower 4D world model for embodied AI, enabling unified policy optimization through high-fidelity simulation and novel learning frameworks, significantly improving manipulation task performance.
Contribution
The paper presents RoboStereo, a symmetric dual-tower 4D world model with bidirectional cross-modal enhancement and a unified policy optimization framework for embodied AI.
Findings
Achieves state-of-the-art generation quality in 4D simulation.
Delivers over 97% relative improvement on manipulation tasks.
Introduces a comprehensive framework combining TTPA, IEPL, and OEPL.
Abstract
Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
