Stereo World Model: Camera-Guided Stereo Video Generation
Yang-Tian Sun, Zehuan Huang, Yifan Niu, Lin Ma, Yan-Pei Cao, Yuewen Ma, Xiaojuan Qi

TL;DR
StereoWorld is a novel stereo video generation model that uses camera-conditioned learning to produce consistent, high-quality stereo videos directly from RGB data, outperforming monocular methods in speed and accuracy.
Contribution
Introduces a camera-conditioned stereo world model with innovative attention mechanisms and positional encoding for efficient, consistent stereo video synthesis directly from RGB inputs.
Findings
Improves stereo consistency and disparity accuracy over monocular pipelines.
Achieves over 3x faster stereo video generation with better viewpoint consistency.
Enables end-to-end stereo VR rendering without depth estimation or inpainting.
Abstract
We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Advanced Optical Imaging Technologies
