MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion

Minghui Hou; Wei-Hsing Huang; Shaofeng Liang; Daizong Liu; Tai-Hao Wen; Gang Wang; Runwei Guan; Weiping Ding

arXiv:2512.13177·cs.CV·December 17, 2025

MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion

Minghui Hou, Wei-Hsing Huang, Shaofeng Liang, Daizong Liu, Tai-Hao Wen, Gang Wang, Runwei Guan, Weiping Ding

PDF

Open Access

TL;DR

MMDrive is a novel multimodal framework that extends scene understanding in autonomous driving from 2D images to 3D by integrating occupancy maps, LiDAR, and text, significantly improving reasoning capabilities.

Contribution

It introduces adaptive cross-modal fusion and key information extraction components, enabling 3D scene understanding beyond traditional image-based models.

Findings

01

Achieves state-of-the-art performance on DriveLM and NuScenes-QA benchmarks.

02

Demonstrates significant improvements in autonomous driving scene understanding.

03

Enables robust multimodal reasoning in complex environments.

Abstract

Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Advanced Neural Network Applications