RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space
Yichen Xie, Chensheng Peng, Mazen Abdelfattah, Yihan Hu, Jiezhi Yang, Eric Higgins, Ryan Brigden, Masayoshi Tomizuka, Wei Zhan

TL;DR
RAYNOVA is a novel world modeling framework that employs a dual-causal autoregressive approach with global attention, enabling robust, scalable, and controllable multi-view video generation in driving scenarios without explicit 3D scene priors.
Contribution
It introduces a scale-temporal autoregressive world model with a unified 4D reasoning framework and relative Plücker-ray encoding, improving generalization and efficiency in multi-view video synthesis.
Findings
Achieves state-of-the-art results on nuScenes multi-view video generation.
Demonstrates robust generalization to new views and camera setups.
Offers higher throughput and controllability compared to existing methods.
Abstract
World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Pl\"ucker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · 3D Shape Modeling and Analysis
