VLMPlanner: Integrating Visual Language Models with Motion Planning
Zhipeng Tang, Sha Zhang, Jiajun Deng, Chenjie Wang, Guoliang You, Yuting Huang, Xinrui Lin, Yanyong Zhang

TL;DR
VLMPlanner integrates vision-language models with motion planning to enhance autonomous driving decision-making by utilizing detailed visual context and adaptive inference, leading to safer and more robust trajectories in complex environments.
Contribution
This work introduces VLMPlanner, a novel hybrid framework combining a real-time motion planner with a vision-language model and a dynamic inference mechanism for improved autonomous driving.
Findings
Outperforms existing methods on nuPlan benchmark
Effectively captures detailed visual cues for better planning
Balances performance and efficiency with CAI-Gate
Abstract
Integrating large language models (LLMs) into autonomous driving motion planning has recently emerged as a promising direction, offering enhanced interpretability, better controllability, and improved generalization in rare and long-tail scenarios. However, existing methods often rely on abstracted perception or map-based inputs, missing crucial visual context, such as fine-grained road cues, accident aftermath, or unexpected obstacles, which are essential for robust decision-making in complex driving environments. To bridge this gap, we propose VLMPlanner, a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images. The VLM processes multi-view images to capture rich, detailed visual information and leverages its common-sense reasoning capabilities to guide the real-time planner in generating robust and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
