MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

Lingjun Zhang; Yujian Yuan; Changjie Wu; Xinyuan Chang; Xin Cai; Shuang Zeng; Linzhe Shi; Sijin Wang; Hang Zhang; Mu Xu

arXiv:2602.21952·cs.CV·February 26, 2026

MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

Lingjun Zhang, Yujian Yuan, Changjie Wu, Xinyuan Chang, Xin Cai, Shuang Zeng, Linzhe Shi, Sijin Wang, Hang Zhang, Mu Xu

PDF

Open Access

TL;DR

MindDriver introduces a progressive multimodal reasoning framework for autonomous driving, combining semantic understanding, scene imagination, and trajectory planning to improve decision-making in vision-language models.

Contribution

It proposes a novel framework with a feedback-guided data annotation pipeline and reinforcement fine-tuning, addressing limitations of existing chain-of-thought methods in autonomous driving.

Findings

01

Outperforms existing methods in nuScenes open-loop evaluation

02

Achieves superior results in Bench2Drive closed-loop tests

03

Demonstrates effective alignment of multimodal reasoning processes

Abstract

Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Generative Adversarial Networks and Image Synthesis