VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion   Transformers

Ziang Guo; Konstantin Gubernatorov; Selamawit Asfaw; Zakhar Yagudin,; Dzmitry Tsetserukou

arXiv:2502.20108·cs.CV·March 4, 2025

VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers

Ziang Guo, Konstantin Gubernatorov, Selamawit Asfaw, Zakhar Yagudin,, Dzmitry Tsetserukou

PDF

Open Access

TL;DR

VDT-Auto introduces an end-to-end autonomous driving pipeline that combines visual language models and diffusion transformers to improve environment understanding and decision-making, demonstrating promising results in accuracy and safety.

Contribution

It presents a novel integration of VLM-guided diffusion transformers for end-to-end autonomous driving, enhancing robustness in dynamic and corner case scenarios.

Findings

01

Achieved 0.52m average L2 error in planning

02

Reduced collision rate to 21% in evaluations

03

Demonstrated strong real-world generalizability

Abstract

In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle's decision-making. To address these challenges, commencing with the representation of state-action mapping in the end-to-end autonomous driving paradigm, we introduce a novel pipeline, VDT-Auto. Leveraging the advancement of the state understanding of Visual Language Model (VLM), incorporating with diffusion Transformer-based action generation, our VDT-Auto parses the environment geometrically and contextually for the conditioning of the diffusion process. Geometrically, we use a bird's-eye view (BEV) encoder to extract feature grids from the surrounding images. Contextually, the structured output of our fine-tuned VLM is processed into textual embeddings and noisy paths. During our diffusion process, the added noise for the forward process is sampled from the noisy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Multimodal Machine Learning Applications · Social Robot Interaction and HRI

MethodsDiffusion