Epona: Autoregressive Diffusion World Model for Autonomous Driving
Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, Xun Cao, Wei Yin

TL;DR
Epona introduces an autoregressive diffusion model for autonomous driving that enables long-horizon, high-resolution world modeling and integrates motion planning, achieving state-of-the-art results in video prediction and real-time planning.
Contribution
The paper proposes a novel autoregressive diffusion framework with decoupled spatiotemporal modeling and modular prediction, advancing long-duration, flexible autonomous driving world modeling.
Findings
7.4% FVD improvement over prior methods
Longer prediction durations in autonomous driving scenarios
Outperforms end-to-end planners on NAVSIM benchmarks
Abstract
Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Advanced Vision and Imaging
MethodsDiffusion
