ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving
Can Cui, Yupeng Zhou, Juntong Peng, Sung-Yeon Park, Zichong Yang, Prashanth Sankaranarayanan, Jiaru Zhang, Ruqi Zhang, Ziran Wang

TL;DR
ViLaD introduces a diffusion-based vision language framework for autonomous driving that enables faster, bidirectional decision-making, outperforming traditional autoregressive models in accuracy and speed, and demonstrating real-world applicability.
Contribution
The paper presents ViLaD, a novel diffusion model architecture for autonomous driving that reduces latency and enables bidirectional reasoning, addressing limitations of autoregressive models.
Findings
Outperforms state-of-the-art autoregressive models in planning accuracy.
Achieves significantly faster inference speed.
Demonstrates practical deployment on an autonomous vehicle.
Abstract
End-to-end autonomous driving systems built on Vision Language Models (VLMs) have shown significant promise, yet their reliance on autoregressive architectures introduces some limitations for real-world applications. The sequential, token-by-token generation process of these models results in high inference latency and cannot perform bidirectional reasoning, making them unsuitable for dynamic, safety-critical environments. To overcome these challenges, we introduce ViLaD, a novel Large Vision Language Diffusion (LVLD) framework for end-to-end autonomous driving that represents a paradigm shift. ViLaD leverages a masked diffusion model that enables parallel generation of entire driving decision sequences, significantly reducing computational latency. Moreover, its architecture supports bidirectional reasoning, allowing the model to consider both past and future simultaneously, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Robotic Path Planning Algorithms
