ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving

Can Cui; Yupeng Zhou; Juntong Peng; Sung-Yeon Park; Zichong Yang; Prashanth Sankaranarayanan; Jiaru Zhang; Ruqi Zhang; Ziran Wang

arXiv:2508.12603·cs.CV·August 19, 2025

ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving

Can Cui, Yupeng Zhou, Juntong Peng, Sung-Yeon Park, Zichong Yang, Prashanth Sankaranarayanan, Jiaru Zhang, Ruqi Zhang, Ziran Wang

PDF

Open Access

TL;DR

ViLaD introduces a diffusion-based vision language framework for autonomous driving that enables faster, bidirectional decision-making, outperforming traditional autoregressive models in accuracy and speed, and demonstrating real-world applicability.

Contribution

The paper presents ViLaD, a novel diffusion model architecture for autonomous driving that reduces latency and enables bidirectional reasoning, addressing limitations of autoregressive models.

Findings

01

Outperforms state-of-the-art autoregressive models in planning accuracy.

02

Achieves significantly faster inference speed.

03

Demonstrates practical deployment on an autonomous vehicle.

Abstract

End-to-end autonomous driving systems built on Vision Language Models (VLMs) have shown significant promise, yet their reliance on autoregressive architectures introduces some limitations for real-world applications. The sequential, token-by-token generation process of these models results in high inference latency and cannot perform bidirectional reasoning, making them unsuitable for dynamic, safety-critical environments. To overcome these challenges, we introduce ViLaD, a novel Large Vision Language Diffusion (LVLD) framework for end-to-end autonomous driving that represents a paradigm shift. ViLaD leverages a masked diffusion model that enables parallel generation of entire driving decision sequences, significantly reducing computational latency. Moreover, its architecture supports bidirectional reasoning, allowing the model to consider both past and future simultaneously, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Robotic Path Planning Algorithms