Epona: Autoregressive Diffusion World Model for Autonomous Driving

Kaiwen Zhang; Zhenyu Tang; Xiaotao Hu; Xingang Pan; Xiaoyang Guo; Yuan Liu; Jingwei Huang; Li Yuan; Qian Zhang; Xiao-Xiao Long; Xun Cao; Wei Yin

arXiv:2506.24113·cs.CV·July 1, 2025

Epona: Autoregressive Diffusion World Model for Autonomous Driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, Xun Cao, Wei Yin

PDF

Open Access 1 Repo 1 Models

TL;DR

Epona introduces an autoregressive diffusion model for autonomous driving that enables long-horizon, high-resolution world modeling and integrates motion planning, achieving state-of-the-art results in video prediction and real-time planning.

Contribution

The paper proposes a novel autoregressive diffusion framework with decoupled spatiotemporal modeling and modular prediction, advancing long-duration, flexible autonomous driving world modeling.

Findings

01

7.4% FVD improvement over prior methods

02

Longer prediction durations in autonomous driving scenarios

03

Outperforms end-to-end planners on NAVSIM benchmarks

Abstract

Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kevin-thu/epona
pytorchOfficial

Models

🤗
Kevin-thu/Epona
model· ♡ 11
♡ 11

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Advanced Vision and Imaging

MethodsDiffusion