VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers
Ziang Guo, Konstantin Gubernatorov, Selamawit Asfaw, Zakhar Yagudin,, Dzmitry Tsetserukou

TL;DR
VDT-Auto introduces an end-to-end autonomous driving pipeline that combines visual language models and diffusion transformers to improve environment understanding and decision-making, demonstrating promising results in accuracy and safety.
Contribution
It presents a novel integration of VLM-guided diffusion transformers for end-to-end autonomous driving, enhancing robustness in dynamic and corner case scenarios.
Findings
Achieved 0.52m average L2 error in planning
Reduced collision rate to 21% in evaluations
Demonstrated strong real-world generalizability
Abstract
In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle's decision-making. To address these challenges, commencing with the representation of state-action mapping in the end-to-end autonomous driving paradigm, we introduce a novel pipeline, VDT-Auto. Leveraging the advancement of the state understanding of Visual Language Model (VLM), incorporating with diffusion Transformer-based action generation, our VDT-Auto parses the environment geometrically and contextually for the conditioning of the diffusion process. Geometrically, we use a bird's-eye view (BEV) encoder to extract feature grids from the surrounding images. Contextually, the structured output of our fine-tuned VLM is processed into textual embeddings and noisy paths. During our diffusion process, the added noise for the forward process is sampled from the noisy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Multimodal Machine Learning Applications · Social Robot Interaction and HRI
MethodsDiffusion
