RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space

Yichen Xie; Chensheng Peng; Mazen Abdelfattah; Yihan Hu; Jiezhi Yang; Eric Higgins; Ryan Brigden; Masayoshi Tomizuka; Wei Zhan

arXiv:2602.20685·cs.CV·February 26, 2026

RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space

Yichen Xie, Chensheng Peng, Mazen Abdelfattah, Yihan Hu, Jiezhi Yang, Eric Higgins, Ryan Brigden, Masayoshi Tomizuka, Wei Zhan

PDF

Open Access

TL;DR

RAYNOVA is a novel world modeling framework that employs a dual-causal autoregressive approach with global attention, enabling robust, scalable, and controllable multi-view video generation in driving scenarios without explicit 3D scene priors.

Contribution

It introduces a scale-temporal autoregressive world model with a unified 4D reasoning framework and relative Plücker-ray encoding, improving generalization and efficiency in multi-view video synthesis.

Findings

01

Achieves state-of-the-art results on nuScenes multi-view video generation.

02

Demonstrates robust generalization to new views and camera setups.

03

Offers higher throughput and controllability compared to existing methods.

Abstract

World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-agonistic multiview world model for driving scenarios that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Pl\"ucker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · 3D Shape Modeling and Analysis