MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving
Lingjun Zhang, Yujian Yuan, Changjie Wu, Xinyuan Chang, Xin Cai, Shuang Zeng, Linzhe Shi, Sijin Wang, Hang Zhang, Mu Xu

TL;DR
MindDriver introduces a progressive multimodal reasoning framework for autonomous driving, combining semantic understanding, scene imagination, and trajectory planning to improve decision-making in vision-language models.
Contribution
It proposes a novel framework with a feedback-guided data annotation pipeline and reinforcement fine-tuning, addressing limitations of existing chain-of-thought methods in autonomous driving.
Findings
Outperforms existing methods in nuScenes open-loop evaluation
Achieves superior results in Bench2Drive closed-loop tests
Demonstrates effective alignment of multimodal reasoning processes
Abstract
Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Generative Adversarial Networks and Image Synthesis
