VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

Zhefan Xu; Ghassen Jerfel; Marina Haliem; Qi Zhao; Jeonhyung Kang; and Khaled S. Refaat

arXiv:2605.20082·cs.CV·May 20, 2026

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

Zhefan Xu, Ghassen Jerfel, Marina Haliem, Qi Zhao, Jeonhyung Kang, and Khaled S. Refaat

PDF

TL;DR

VL-DPO introduces a vision-language-guided finetuning method for autonomous driving that aligns motion forecasting with human preferences, improving performance metrics significantly.

Contribution

The paper proposes a novel framework using vision-language models to automatically generate preference data for finetuning autonomous driving models.

Findings

01

VL-DPO achieves an 11.94% increase in RFS.

02

VL-DPO reduces ADE by 10.01%.

03

VLM-based trajectory selection correlates well with human preferences.

Abstract

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.