VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

Ruifei Zhang; Wei Zhang; Xiao Tan; Sibei Yang; Xiang Wan; Xiaonan Luo; Guanbin Li

arXiv:2511.06256·cs.CV·November 11, 2025

VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

Ruifei Zhang, Wei Zhang, Xiao Tan, Sibei Yang, Xiang Wan, Xiaonan Luo, Guanbin Li

PDF

Open Access

TL;DR

VLDrive introduces a lightweight, vision-augmented large language model for autonomous driving, significantly reducing parameters while improving driving performance through innovative visual and linguistic feature integration.

Contribution

The paper presents VLDrive, a novel lightweight MLLM architecture with enhanced vision components and a new attention mechanism, addressing visual limitations and deployment challenges of existing LLM-based driving models.

Findings

01

Achieves state-of-the-art driving performance in CARLA simulator.

02

Reduces model parameters by 81%, from 7B to 1.3B.

03

Improves driving scores by up to 16.8% at various distances.

Abstract

Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning