VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion

Pei Liu; Haipeng Liu; Haichao Liu; Xin Liu; Jinxin Ni; Jun Ma

arXiv:2502.18042·cs.CV·September 19, 2025

VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion

Pei Liu, Haipeng Liu, Haichao Liu, Xin Liu, Jinxin Ni, Jun Ma

PDF

Open Access

TL;DR

VLM-E2E introduces a multimodal framework that enhances end-to-end autonomous driving by integrating vision-language models to improve semantic understanding and modality fusion, leading to better perception and decision-making.

Contribution

It is the first to incorporate textual attentional cues into BEV features for autonomous driving, addressing modality imbalance with a learnable fusion strategy.

Findings

01

Significant improvements in perception accuracy on nuScenes dataset.

02

Enhanced prediction and planning capabilities over baseline models.

03

Better alignment with human-like driving behavior.

Abstract

Human drivers adeptly navigate complex scenarios by utilizing rich attentional semantics, but the current autonomous systems struggle to replicate this ability, as they often lose critical semantic information when converting 2D observations into 3D space. In this sense, it hinders their effective deployment in dynamic and complex environments. Leveraging the superior scene understanding and reasoning abilities of Vision-Language Models (VLMs), we propose VLM-E2E, a novel framework that uses the VLMs to enhance training by providing attentional cues. Our method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision, which enables the model to learn richer feature representations that explicitly capture the driver's attentional semantics. By focusing on attentional semantics, VLM-E2E better aligns with human-like driving behavior, which is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Human-Automation Interaction and Safety · Advanced Neural Network Applications