VLMPlanner: Integrating Visual Language Models with Motion Planning

Zhipeng Tang; Sha Zhang; Jiajun Deng; Chenjie Wang; Guoliang You; Yuting Huang; Xinrui Lin; Yanyong Zhang

arXiv:2507.20342·cs.AI·July 29, 2025

VLMPlanner: Integrating Visual Language Models with Motion Planning

Zhipeng Tang, Sha Zhang, Jiajun Deng, Chenjie Wang, Guoliang You, Yuting Huang, Xinrui Lin, Yanyong Zhang

PDF

TL;DR

VLMPlanner integrates vision-language models with motion planning to enhance autonomous driving decision-making by utilizing detailed visual context and adaptive inference, leading to safer and more robust trajectories in complex environments.

Contribution

This work introduces VLMPlanner, a novel hybrid framework combining a real-time motion planner with a vision-language model and a dynamic inference mechanism for improved autonomous driving.

Findings

01

Outperforms existing methods on nuPlan benchmark

02

Effectively captures detailed visual cues for better planning

03

Balances performance and efficiency with CAI-Gate

Abstract

Integrating large language models (LLMs) into autonomous driving motion planning has recently emerged as a promising direction, offering enhanced interpretability, better controllability, and improved generalization in rare and long-tail scenarios. However, existing methods often rely on abstracted perception or map-based inputs, missing crucial visual context, such as fine-grained road cues, accident aftermath, or unexpected obstacles, which are essential for robust decision-making in complex driving environments. To bridge this gap, we propose VLMPlanner, a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images. The VLM processes multi-view images to capture rich, detailed visual information and leverages its common-sense reasoning capabilities to guide the real-time planner in generating robust and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.