RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

Ruicheng Zhang; Guangyu Chen; Zunnan Xu; Zihao Liu; Zhizhou Zhong; Mingyang Zhang; Jun Zhou; Xiu Li

arXiv:2603.12639·cs.CV·April 14, 2026

RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

Ruicheng Zhang, Guangyu Chen, Zunnan Xu, Zihao Liu, Zhizhou Zhong, Mingyang Zhang, Jun Zhou, Xiu Li

PDF

TL;DR

RoboStereo introduces a dual-tower 4D world model for embodied AI, enabling unified policy optimization through high-fidelity simulation and novel learning frameworks, significantly improving manipulation task performance.

Contribution

The paper presents RoboStereo, a symmetric dual-tower 4D world model with bidirectional cross-modal enhancement and a unified policy optimization framework for embodied AI.

Findings

01

Achieves state-of-the-art generation quality in 4D simulation.

02

Delivers over 97% relative improvement on manipulation tasks.

03

Introduces a comprehensive framework combining TTPA, IEPL, and OEPL.

Abstract

Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.