DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Zhe Liu; Runhui Huang; Rui Yang; Siming Yan; Zining Wang; Lu Hou; Di Lin; Xiang Bai; Hengshuang Zhao

arXiv:2512.12799·cs.CV·December 16, 2025

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Zhe Liu, Runhui Huang, Rui Yang, Siming Yan, Zining Wang, Lu Hou, Di Lin, Xiang Bai, Hengshuang Zhao

PDF

Open Access

TL;DR

DrivePI introduces a spatial-aware 4D multi-modal large language model that unifies perception, prediction, and planning for autonomous driving, achieving state-of-the-art results with a compact model.

Contribution

It presents a novel unified framework that integrates 3D perception, spatial understanding, and action planning in a single end-to-end model for autonomous driving.

Findings

01

Outperforms existing VLA models on nuScenes-QA and collision rate reduction.

02

Surpasses specialized VA models in 3D occupancy and occupancy flow accuracy.

03

Achieves significant improvements in planning error metrics.

Abstract

Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Autonomous Vehicle Technology and Safety