FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving
Yaoru Li, Federico Landi, Marco Godi, Xin Jin, Ruiju Fu, Yufei Ma, Muyang Sun, Heyu Si, Qi Guo

TL;DR
FAR-Drive introduces a novel autoregressive video generation framework for autonomous driving simulation, addressing long-term consistency, multi-view generation, and low-latency inference, enabling more reliable and interactive training environments.
Contribution
The paper presents a multi-view diffusion transformer with structured control and a two-stage training strategy to improve closed-loop autonomous driving simulation.
Findings
Achieves state-of-the-art performance on nuScenes dataset
Maintains sub-second latency on a single GPU
Enhances long-horizon consistency and robustness in simulation
Abstract
Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Advanced Vision and Imaging
